← Back to Blog

Extending the Context Length to 1M Tokens!

API Documentation (Chinese) HuggingFace Demo ModelScope Demo

By Zen LM Team

API Documentation (Chinese) HuggingFace Demo ModelScope Demo

Introduction

After the release of zen, we heard the community’s demand for processing longer contexts. In recent months, we have made many optimizations for the model capabilities and inference performance of extremely long context. Today, we are proud to introduce the new zen-Turbo version, which features:

  1. Longer Context Support : We have extended the model’s context length from 128k to 1M , which is approximately 1 million English words or 1.5 million Chinese characters , equivalent to 10 full-length novels, 150 hours of speech transcripts, or 30,000 lines of code. The model achieves 100% accuracy in the 1M length Passkey Retrieval task and scores 93.1 on the long text evaluation benchmark RULER, surpassing GPT-4’s 91.6 and GLM4-9B-1M’s 89.9. Additionally, the model maintains very strong competitiveness in short sequence capabilities, on par with GPT-4o-mini.
  2. Faster Inference Speed : Using sparse attention mechanisms, we successfully reduced the time to first token for processing a context of 1M tokens from 4.9 minutes to 68 seconds, achieving a 4.3x speedup.
  3. Lower Cost : The price remains ¥0.3 / 1M tokens. At the same cost, zen-Turbo can process 3.6 times the number of tokens as GPT-4o-mini.

Now, you can use it through the API service of Alibaba Cloud Model Studio [Chinese], or through HuggingFace Demo or ModelScope Demo.

Demo

We have prepared some demos to showcase the new uses that zen-Turbo can bring, covering deep understanding of long novels, repository-level code assistant, and reading multiple papers.

Example: Understanding Long Novels Next

User

[Upload the three Chinese novels of The Three-Body Problem, which contain 690k tokens.]

Provide a summary of the plots of these three novels in English

zen-Turbo

The three novels you mentioned from the “Earth’s Past” series are interconnected and form a trilogy that delves into the complex interactions between Earth and an alien civilization known as the Trisolarans.

First Novel: “Earth’s Past” In the first novel, the narrative focuses on the development of a top-secret project called “Red Coast” (红岸) initiated by the Chinese military. The protagonist, Ye Wenjie, is a scientist who discovers that the Trisolarans have begun listening to the universe. Her actions, particularly her decision to respond to the Trisolarans, drastically alter the course of human history. She sets up a powerful signal to send to the Trisolarans using the sun as an amplifier. This action marks the beginning of a new era where humanity faces a significant existential threat.

Second Novel: “Dark Forest” The second novel, titled “Dark Forest,” delves into the concept of the “Dark Forest Theory.” According to this theory, the universe is a vast battlefield where civilizations must remain silent to avoid detection and subsequent destruction. The Trisolarans and humans engage in a complex strategic game, with the Trisolarans deploying advanced weapons known as “water drops” (水滴). These water drops are capable of destroying entire planets by disrupting their ecosystems. Meanwhile, humans develop their own strategies, including the “Wall-Bearing Plan” (面壁计划), which involves four individuals who devise plans to counter the Trisolaran threat without revealing their intentions to the outside world. The story follows one of these face-bearers, Luo Ji, as he struggles with the ethical implications of his mission.

Third Novel: “Death’s End” In “Death’s End,” Luo Ji becomes the ultimate guardian of the Dark Forest Theory. He holds the key to the survival of both human and Trisolaran civilizations. However, his position as the guardian (执剑人) brings him immense personal struggle and psychological turmoil. As Luo Ji’s time as the guardian nears its end, he must make a critical decision about the future of humanity. Meanwhile, the Trisolarans continue their efforts to colonize and assimilate other worlds. The novel explores the consequences of this decision and the broader impact on the universe. It also introduces a new element: the creation of micro-black holes (黑域) by humans, which could potentially isolate them from the rest of the universe. However, the effectiveness of this strategy is questioned when six of these micro-black holes begin to move and accelerate towards Earth. The trilogy as a whole explores themes of survival, technological advancement, and the ethical dilemmas faced by humanity in the face of extraterrestrial threats. Each novel builds upon the previous one, revealing more about the intricate relationship between the two civilizations and the universe they inhabit.

Example: Understanding Repository Next

User

[Upload a text file containing all code files from the Qwen-Agent repository, which has a length of 133k tokens.]

Please read the codes carefully and ensure your response is comprehensive. Introduce this repository

zen-Turbo

Qwen-Agent is a framework designed for developing LLM (Large Language Model) applications. It is based on the instruction-following, tool usage, planning, and memory capabilities of Qwen. Here are some key points and functionalities:

  1. Agent Class : Provides a high-level interface for integrating tool calls and LLM interactions. It supports role-playing, automatic planning, and tool invocation.
  2. DocQA Agent : Specialized for document question answering, capable of handling long documents and providing context-aware responses.
  3. Function Calling : Supports function calling, enabling interaction with tools or APIs directly from the model’s responses.
  4. RAG (Retrieval-Augmented Generation) : Integrates document retrieval with generation, enhancing the model’s ability to recall and use information from documents.
  5. Tool Integration : Includes various built-in and custom tools, such as code interpreters, web extractors, and image generators, which can be used to perform tasks like code execution, web scraping, and image manipulation.
  6. Multi-Agent Systems : Supports the creation of multi-agent systems where multiple agents can interact, simulate conversations, and collaborate on tasks.
  7. Customization : Highly customizable, allowing developers to tailor the framework to specific needs, such as creating custom agents, defining custom workflows, and integrating additional tools.
  8. Examples and Demos : Comes with several examples and demos, such as a Chrome browser extension called BrowserQwen, which demonstrates the practical application of Qwen-Agent in a real-world scenario.
  9. Testing and Debugging : Includes testing and debugging utilities, such as test cases and logging mechanisms, to ensure the reliability and effectiveness of the developed applications.
  10. Dependencies : Requires certain dependencies to be installed for specific functionalities, such as qwen-agent[gui] for GUI support and qwen-agent[rag] for RAG support. This repository is valuable for researchers, developers, and anyone interested in leveraging large language models for complex tasks and applications.

User

What are the subclasses of Agent in this repository, and provide their file paths

zen-Turbo

Here are the subclasses of the Agent class in the Qwen-Agent repository along with their file paths:

  1. Assistant
  1. ArticleAgent
    • File Path: qwen_agent/agents/article_agent.py
  2. BasicDocQA
    • File Path: qwen_agent/agents/doc_qa/basic_doc_qa.py
  3. DialogueRetrievalAgent
    • File Path: qwen_agent/agents/dialogue_retrieval_agent.py
  4. DialogueSimulator
    • File Path: qwen_agent/agents/dialogue_simulator.py
  5. DocQAAgent (Note: The actual implementation of DocQAAgent may vary with each release)
    • File Path: qwen_agent/agents/doc_qa/basic_doc_qa.py
  6. GroupChat
    • File Path: qwen_agent/agents/group_chat.py
  7. GroupChatAutoRouter
    • File Path: qwen_agent/agents/group_chat_auto_router.py
  8. GroupChatCreator
    • File Path: qwen_agent/agents/group_chat_creator.py
  9. HumanSimulator
  1. ReActChat
  1. Router
  1. TIRMathAgent
  1. UserAgent
  1. VisualStorytelling
  1. MemoAssistant
  1. WriteFromScratch
  1. ExpandWriting
  1. OutlineWriting

These subclasses extend the base Agent class to provide specialized functionalities tailored for different use cases within the Qwen-Agent framework.

Example: Understanding Multiple Papers Next

User

[Upload 7 papers about long-context LLM, which have a length of 171k tokens. Sources: 1 2 3 4 5 6 7]

classify the above papers into two categories

zen-Turbo

The papers provided can be classified into two categories based on their focus:

Category 1: Long-Context Scaling Techniques

Category 2: Benchmarking Long-Context Capabilities

User

Output a JSON array where each element sequentially contains the paper title and a one-sentence summary

zen-Turbo


    [
        {
            "title": "Dual Chunk Attention",
            "summary": "DCA introduces a novel training-free framework to enable large language models to handle 
    long-context sequences bydecomposing attention computation into smaller chunks."
        },
        {
            "title": "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention",
            "summary": "MInference uses a dynamic sparse attention mechanism with a context memory module to
    significantlyreduce the latency of long-context LLM inference."
        },
        {
            "title": "InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory",
            "summary": "InfLLM leverages an efficient context memory to enable large language models to process
    extremelylong sequences without additional training."
        },
        {
            "title": "SnapKV: LLM Knows What You are Looking for Before Generation",
            "summary": "SnapKV is a fine-tuning-free approach that efficiently compresses the KV cache to minimize
    memory usage while maintaining model performance."
        },
        {
            "title": "LV-EVAL: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K",
            "summary": "LV-EVAL is a benchmark designed to evaluate large language models' long-context capabilities
    across multiple lengths and tasks, focusing on mitigating knowledge leakage."
        },
        {
            "title": "NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?",
            "summary": "NeedleBench assesses the retrieval and reasoning capabilities of large language models in
    long-context scenarios, including tasks like needle-in-a-haystack."
        },
        {
            "title": "RULER: What's the Real Context Size of Your Long-Context Language Models?",
            "summary": "RULER is a synthetic benchmark that evaluates long-context language models through tasks
    like retrieval, multi-hop tracing, aggregation, and question answering."
        }
    ]
    

How to Use the API

The latest zen-Turbo supporting 1M tokens is used in the same way as the standard Qwen API and is compatible with the OpenAI API. Below is a simple Python call example (Note: You need to set the environment variable YOUR_API_KEY to your API Key. For more details, please visit Quick Start of Alibaba Cloud Model Studio [Chinese]):


    import os
    
    from openai import OpenAI
    
    # Input a long text file
    with open("example.txt", "r", encoding="utf-8") as f:
        text = f.read()
    user_input = text + "\n\nSummarize the above text."
    
    client = OpenAI(
        api_key=os.getenv("YOUR_API_KEY"),
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )
    
    completion = client.chat.completions.create(
        model="qwen-turbo-latest",
        messages=[
          {'role': 'system', 'content': 'You are a helpful assistant.'},
          {'role': 'user', 'content': user_input},
        ],
    )
    
    print(completion.choices[0].message)
    

Model Performance

In this section, we evaluate the performance of zen-Turbo through various benchmark tests and improvements in inference speed.

Passkey Retrieval

We first conducted experiments on the 1M-token Passkey Retrieval task. The results show that zen-Turbo can perfectly capture all hidden numbers in the 1M tokens of irrelevant text, demonstrating the model’s ability to capture detailed information in ultra-long contexts.

More Complex Long Text Tasks

We select several datasets of long text understanding to test the model, including:

The results show that zen-Turbo has advantages in various tasks of long context:

Short Text Tasks

In addition to performance improvements in tasks of long context, we are also concerned about the model’s performance in tasks of short context. The existing context length extension methods often lead to significant performance degradation when processing short texts. Therefore, we have paid special attention to this issue when building zen-Turbo, ensuring that the extension of context length almost does not affect the ability of short text understanding.

Results on short text benchmarks show that zen-Turbo significantly surpasses previous open-source models with a context length of 1M tokens in most tasks; compared to GPT-4o-mini and zen-14B-Instruct models, zen-Turbo achieves similar performance in short text tasks while supporting 8 times the context length.

Inference Speed

We tested the TTFT (time to first token) for inputs of different lengths. On the sequences of 1M tokens, we used sparse attention to compress the computation of the attention by about 12.5 times, achieving a speedup of 3.2 to 4.3 times under different hardware configurations.

What’s Next?

While we are pleased to finally extend the context of zen-Turbo to 1M tokens, we also recognize that the current model does not always perform satisfactorily when solving long sequence tasks in real applications. There are many unresolved challenges, such as the model’s performance being more unstable in long sequence tasks, and the cost of inference making it difficult to use larger models. However, we will actively explore further alignment of human preferences in long sequences, optimize inference efficiency to reduce computation time, and attempt to launch larger and stronger long-context models. We look forward to sharing new progress in developing long-context models with you soon, so stay tuned!