Managing Long Context

As LLM-based agents often engage in multi-turn, prolonged, or complex workflows, a key challenge they may encounter is the extremely long context, which significantly raises computational costs, slows down inference, and often reduces accuracy due to models' bias towards recent inputs. Therefore, naively using maximal-length contexts remains impractical for real-world agents. To address this, there are three primary techniques: Context Compression, Context Reuse, and Hierarchical Context Management.

Context Compression

The first strategy is context compression, which reduces the prompt to the minimum necessary set. In Retrieval-Augmented Generation (RAG), a large knowledge source is segmented and indexed so that, at query time, only the top-k relevant chunks are inserted into the prompt instead of the whole corpus. In addition, agents can summarize or merge prior turns to keep the running transcript short; and they can offload long-term information to external memory (e.g., databases or vector stores) and re-inject it only on demand, so the working prompt stays small by default.

Context Reuse

The second strategy is context reuse. Intuitively, prompt caching lets the system treat an unchanged prefix as "already processed", so it does not need to be recomputed on every turn. Concretely, servers reuse precomputed key–value attention states to skip redundant prefix computation, reducing latency and cost. To make caching effective, maintain a stable prefix—system role, shared tools, policies—so cache hits remain consistent across multi-turn conversations. When tool availability must change, avoid editing the prompt; keep all tool definitions in the prefix and switch tools on or off during decoding (mask, don't remove) to preserve a byte-identical, cache-friendly prefix while constraining actions.

Hierarchical Context Management

The third strategy is hierarchical context management: rather than continually expanding a single transcript, we structure contexts according to role and scope. An orchestrator manages global goals and constraints, delegating detailed execution to specialized sub-agents, each operating with concise, role-specific prompts. Between these agents, structured summaries, such as key inputs, assumptions, and outputs, are passed instead of lengthy transcripts, ensuring each stage uses a minimal, cache-friendly context. Additionally, capabilities are organized into modular and reusable skill bundles, loaded into context only as needed, further limiting prompt size and complexity. Such hierarchical designs are increasingly common in frameworks and community practices focused on subagent orchestration and modular skill management.

If you find this work helpful, please consider citing our paper:

@article{hu2025hands,
  title={Hands-on LLM-based Agents: A Tutorial for General Audiences},
  author={Hu, Shuyue and Ren, Siyue and Chen, Yang and Mu, Chunjiang and Liu, Jinyi and Cui, Zhiyao and Zhang, Yiqun and Li, Hao and Zhou, Dongzhan and Xu, Jia and others},
  journal={Hands-on},
  volume={21},
  pages={6},
  year={2025}
}

Back to Home Next