Context Management
Strategies for working within context window limits: summarization, selective loading, and memory patterns for agent skills.
You’re three steps into a codebase refactor when the agent starts repeating itself. It asks you a question you already answered two messages ago. It re-reads a file it summarized five minutes earlier. Then it loses track of which files it already modified and starts proposing changes that conflict with its own earlier work. The conversation hasn’t gotten confusing; the agent just ran out of room to think.
This is what happens when context management goes wrong. Every agent operates within a context window, a fixed amount of text it can consider at once, and when that fills up, the agent’s reasoning degrades fast. These strategies apply whether you’re designing individual skills or orchestrating multi-step workflows.
Understanding the context budget
Before you can optimize, you need to know where your context is going. A typical agent session breaks down roughly like this:
| Component | Typical size | Notes |
|---|---|---|
| System prompt | 500-2,000 tokens | Instructions, personality, constraints |
| Tool definitions | 1,000-5,000 tokens | Scales with number of available tools |
| Conversation history | 2,000-50,000 tokens | Grows with each turn |
| Tool results | 500-20,000+ tokens per call | Largest variable; a single file read can be huge |
| Agent reasoning | 1,000-5,000 tokens per turn | Chain-of-thought, planning |
The biggest offender is almost always tool results. A single file read can consume thousands of tokens. A search across a codebase might return hundreds of matches. Without careful management, a few tool calls can eat your entire budget.
Summarization and compression strategies
The core idea is simple: don’t keep raw data in context when a summary will do.
Summarize tool results immediately
When a skill returns large results, summarize them before they enter the agent’s working memory. This can happen at the skill level (the skill itself returns a summary) or at the orchestration level (a post-processing step compresses the output).
// Bad: returning raw file contents into context
async function readFile(path: string): Promise<ToolResult> {
const content = await fs.readFile(path, "utf-8");
return { content }; // Could be 10,000+ tokens
}
// Better: return with metadata that helps the agent decide what to keep
async function readFile(
path: string,
options?: ReadOptions,
): Promise<ToolResult> {
const content = await fs.readFile(path, "utf-8");
const lines = content.split("\n");
if (options?.summaryOnly) {
return {
path,
lineCount: lines.length,
language: detectLanguage(path),
exports: extractExports(content),
imports: extractImports(content),
summary: `${lines.length} lines of ${detectLanguage(path)}. Key exports: ${extractExports(content).join(", ")}`,
};
}
// If full content requested but file is large, truncate with guidance
if (lines.length > 200) {
return {
path,
content: lines.slice(0, 200).join("\n"),
truncated: true,
totalLines: lines.length,
message:
"File truncated at 200 lines. Use offset parameter to read specific sections.",
};
}
return { path, content, truncated: false };
}
Progressive detail loading
Start with high-level summaries and drill down only where needed. This is the single most effective strategy for context management.
async def explore_codebase(path: str) -> dict:
"""Level 1: Directory structure overview."""
tree = await invoke("list_directory", path=path, recursive=True, depth=2)
return {
"structure": tree,
"file_count": count_files(tree),
"languages": detect_languages(tree),
"hint": "Use read_file_summary for details on specific files.",
}
async def read_file_summary(path: str) -> dict:
"""Level 2: File-level summary without full content."""
content = await read_file(path)
return {
"path": path,
"line_count": len(content.splitlines()),
"functions": extract_function_signatures(content),
"classes": extract_class_names(content),
"imports": extract_imports(content),
"hint": "Use read_file_section to read specific functions or line ranges.",
}
async def read_file_section(path: str, start: int, end: int) -> dict:
"""Level 3: Specific section of a file."""
lines = (await read_file(path)).splitlines()
return {
"path": path,
"range": f"lines {start}-{end} of {len(lines)}",
"content": "\n".join(lines[start:end]),
}
This three-level approach (overview, summary, detail) lets the agent navigate a large codebase while keeping context usage proportional to what it actually needs.
Selective context loading
Not everything needs to be in context at once. Skills should load information on demand rather than preloading everything upfront.
Pattern: lazy loading with caching
class ContextManager {
private cache = new Map<string, { data: unknown; accessedAt: Date }>();
private maxCacheSize: number;
constructor(maxCacheSize = 20) {
this.maxCacheSize = maxCacheSize;
}
async get(key: string, loader: () => Promise<unknown>): Promise<unknown> {
if (this.cache.has(key)) {
const entry = this.cache.get(key)!;
entry.accessedAt = new Date();
return entry.data;
}
// Evict least recently accessed if at capacity
if (this.cache.size >= this.maxCacheSize) {
this.evictLeastRecent();
}
const data = await loader();
this.cache.set(key, { data, accessedAt: new Date() });
return data;
}
private evictLeastRecent(): void {
let oldestKey = "";
let oldestTime = new Date();
for (const [key, entry] of this.cache) {
if (entry.accessedAt < oldestTime) {
oldestTime = entry.accessedAt;
oldestKey = key;
}
}
if (oldestKey) this.cache.delete(oldestKey);
}
}
Pattern: relevance-based filtering
When a search returns many results, filter by relevance before adding them to context. This is especially important for search skills that might match hundreds of files.
def filter_search_results(
results: list[SearchResult],
query_context: str,
max_results: int = 10,
) -> list[SearchResult]:
"""Filter search results to the most relevant subset."""
scored = []
for result in results:
score = 0
# Exact filename match scores highest
if query_context.lower() in result.path.lower():
score += 10
# Results in src/ are usually more relevant than node_modules/
if "/src/" in result.path:
score += 5
if "node_modules" in result.path or "vendor" in result.path:
score -= 20
# More recent files are often more relevant
if result.modified_days_ago < 7:
score += 3
scored.append((score, result))
scored.sort(key=lambda x: x[0], reverse=True)
return [result for _, result in scored[:max_results]]
Memory patterns: short-term vs. long-term
Agent skills need different memory strategies depending on how long the information needs to stick around. For a non-technical introduction to these concepts, Agent Memory Patterns explains how agents remember and forget in plain language.
Short-term memory: conversation context
Short-term memory lives in the current conversation. It’s fast and directly accessible, but ephemeral and limited by the context window. Most workflow state lives here.
Best practices for short-term memory:
- Summarize completed steps rather than keeping full results
- Drop intermediate results once downstream steps have consumed them
- Use structured summaries the agent can quickly scan
// Instead of keeping all raw results:
const rawResults = {
step1: {
/* 2000 tokens of data */
},
step2: {
/* 3000 tokens of data */
},
step3: {
/* 1500 tokens of data */
},
};
// Maintain a running summary:
const workingSummary = {
completedSteps: ["fetch_data", "validate_schema", "transform"],
keyFindings: [
"Schema has 3 breaking changes in users table",
"47 records failed date format validation",
"Transform produced 10,234 clean records",
],
nextStep: "load_to_destination",
blockers: [],
};
Long-term memory: persistent storage
For information that persists across conversations (user preferences, project context, learned patterns), use external storage accessed through skills. This avoids loading everything into context upfront.
class ProjectMemory:
"""Persistent memory for project-specific context."""
def __init__(self, storage_path: str):
self.storage_path = storage_path
async def remember(self, key: str, value: str, category: str = "general") -> None:
"""Store a fact for later retrieval."""
memories = await self._load()
memories[key] = {
"value": value,
"category": category,
"stored_at": datetime.now().isoformat(),
}
await self._save(memories)
async def recall(self, category: str | None = None, query: str | None = None) -> list[dict]:
"""Retrieve relevant memories, optionally filtered."""
memories = await self._load()
results = []
for key, entry in memories.items():
if category and entry["category"] != category:
continue
if query and query.lower() not in entry["value"].lower():
continue
results.append({"key": key, **entry})
return results
Choosing the right memory strategy
| What you need | Strategy | Example |
|---|---|---|
| Current task state | Short-term (context) | Workflow progress, intermediate results |
| File contents being edited | Short-term with eviction | Keep only the files currently being modified |
| Project structure | Long-term, loaded on demand | Directory layout, tech stack, conventions |
| User preferences | Long-term, loaded at start | Coding style, preferred tools, common paths |
| Previous conversation outcomes | Long-term, searched when relevant | Past decisions, resolved issues |
Context window recovery
When context is running low mid-task, skills need strategies to keep going.
Pattern: context compression checkpoint
When a workflow detects it’s approaching context limits, it should compress its state before continuing.
function compressWorkflowState(ctx: WorkflowContext): WorkflowContext {
// Replace detailed step results with summaries
for (const [step, result] of ctx.stepResults) {
if (typeof result === "object" && result !== null) {
ctx.stepResults.set(step, {
summary: result.summary || `Step ${step} completed successfully`,
keyOutputs: extractKeyOutputs(result),
// Drop raw data, keep only what downstream steps need
});
}
}
return ctx;
}
This is where context management connects directly to error handling. Running out of context mid-workflow is a failure mode your skills should anticipate and handle gracefully, not silently degrade through.
Context is the most constrained resource your agent has, and most skills waste it without realizing it. The teams that build agents capable of handling large, multi-file tasks are the ones that treat every token like it costs something. Return summaries instead of raw data, load details only when you need them, and assume that context will run low before the job is done. The agent that plans for a tight budget outperforms the one that assumes infinite room every time.