← Back to Blog

Generalizing an LLM from 8k to 1M Context using Qwen-Agent

We’ve created an agent using zen models with an 8k context size to understand documents with 1M tokens, surpassing RAG and native long-context models. This agent was also used to generate data for training new long-context Qwen models.

By Zen LM Team

Qwen-Agent

TLDR: We’ve created an agent using zen models with an 8k context size to understand documents with 1M tokens, surpassing RAG and native long-context models. This agent was also used to generate data for training new long-context Qwen models.

Introduction

Recently, there has been a hype trend in LLMs that can natively process sequences of millions of tokens. Most work has been focusing on sophisticated mathematical tweaks like RoPE-based extrapolation or architectural overhauls such as non-transformer LLMs. However, preparing fine-tuning data that is sufficiently long is a less discussed but equally important topic.

We adopt the following approach:

  1. We use a weak 8k-context chat model to build a relatively strong agent capable of handling 1M-contexts.
  2. Subsequently, we synthesize fine-tuning data using the agent and apply automated filtering to ensure quality.
  3. Finally, we use the synthetic data to fine-tune a pretrained model, resulting in a strong 1M-context chat model.

This blog primarily focuses on Step 1, with details of the subsequent steps to be revealed in the coming weeks or months.

Building the Agent

The agent we are building consists of three levels of complexity, each building upon the previous one.

Level 1: Retrieval-Augmented Generation

A naive approach to processing a 1M-token context is to simply use retrieval-augmented generation (RAG) . RAG divides the context into shorter chunks, each not exceeding 512 tokens, for example, and then retains only the most relevant chunks within an 8k-token context.

The challenge lies in how to pinpoint the chunks that are the most relevant. After several trials, we have come up with a keyword-based solution:

Dataflows of retrieval-augmented generation

We have also experimented with vector-based retrieval. However, in most cases, it does not offer a significant enough improvement to outweigh the additional complexity that arises from the necessity of deploying a separate embedding model.

RAG Code

Level 2: Chunk-by-Chunk Reading

The aforementioned RAG approach is fast but often fails when the relevant chunks do not have sufficient keyword overlap with the user query, resulting in these chunks not being retrieved and thus not provided to the model. Although vector retrieval theoretically can mitigate this issue, in practice, it frequently does not.

To address this limitation, we employ a brute-force strategy to reduce the chance of missing relevant context:

Dataflows of chunk-by-chunk reading

Agent Code

Level 3: Step-by-Step Reasoning

A classic challenge in document-based question-answering is multi-hop reasoning. For example, consider answering the question “What vehicle was invented in the same century as the Fifth Symphony was composed?” when given a long document containing relevant facts. The model needs to first determine the answer to the sub-question “In which century was the Fifth Symphony composed?” which is the 19th century. Then, it can realize that a chunk containing “Bicycles were invented in the 19th century” is actually relevant to the original question.

Tool-calling (also known as function-calling) agents or ReAct agents are classic solutions that have built-in capabilities for question decomposition and step-by-step reasoning. We therefore wrap the aforementioned Level-2 agent as a tool to be called by a tool-calling agent. The tool-calling agent conducts multi-hop reasoning as follows:


    Ask the Lv3-Agent a question.
    while (the Lv3-Agent cannot answer the question based on its memory) {
        The Lv3-Agent proposes a new sub-question to be answered.
        The Lv3-Agent asks the Lv2-Agent the sub-question.
        Add the Lv2-Agent's response to the Lv3-Agent's memory.
    }
    The Lv3-Agent provides the final answer to the original question.
    

Dataflows of step-by-step reasioning

For example, the Lv3-Agent initially poses a sub-question to the Lv2-Agent: “In which century was Beethoven’s Fifth Symphony composed?” Upon receiving the response, “the 19th century,” the Lv3-Agent formulates a subsequent sub-question: “What vehicle was invented during the 19th century?” By consolidating all the feedback from the Lv2-Agent, the Lv3-Agent can then answer the original question: “What vehicle was invented in the same century that the Fifth Symphony was composed?”

Experiments

We conducted experiments on two benchmarks designed for 256k-context:

We compared the following methods:

The empirical data reveals:

Overall, the 32k-Model should ideally outshine all if it receives proper training. However, due to its under-training in practice, the 32k-Model under-performs compared to the 4k-Agent.

Finally, we have also tested the agent on a 1-million-token pressure test (finding a single needle in a haystack of 1 million tokens) and found that it functioned properly. However, we still lack a more reliable quantitative benchmark for evaluating its performance in handling contexts of 1 million tokens in real-world applications.

Conclusion

In this blog, we have introduced how to build the agent that is capable of handling 1M-context with a 8k-context model. It then becomes obvious how to synthesize the data once the agent is prepared. For instance, we could enlist volunteers to interact with the agents and record the outcomes to construct the fine-tuning dataset. Additionally, we can employ the agent to cross-validate the data generated by other methods to ensure the quality of the data. Moreover, the general idea of distilling an agent into a model is applicable to other fields as well, such as enhancing a model’s ability to solve long-horizon tasks.

What’s More

Qwen-Agent, our open-source RAG and agent framework, which began as internal utility code to facilitate model development, has recently undergone rapid development. We have released an implementation of the aforementioned long-context agent in the framework.

We hope to provide you with models that have improved capabilities for handling long contexts, as well as a more user-friendly infrastructure framework in the near future.

Citation


    @misc{qwen-agent-2405,
        title = {Generalizing an LLM from 8k to 1M Context using Qwen-Agent},
        url = {https://qwenlm.github.io/blog/qwen-agent-2405/},
        author = {Qwen Team},
        month = {May},
        year = {2024}
    }