Technical design context memory optimization

Context Management

Strategies for working within context window limits: summarization, selective loading, and memory patterns for agent skills.

You’re three steps into a codebase refactor when the agent starts repeating itself. It asks you a question you already answered two messages ago. It re-reads a file it summarized five minutes earlier. Then it loses track of which files it already modified and starts proposing changes that conflict with its own earlier work. The conversation hasn’t gotten confusing; the agent just ran out of room to think.

This is what happens when context management goes wrong. Every agent operates within a context window, a fixed amount of text it can consider at once, and when that fills up, the agent’s reasoning degrades fast. These strategies apply whether you’re designing individual skills or orchestrating multi-step workflows.

Understanding the context budget

Before you can optimize, you need to know where your context is going. A typical agent session breaks down roughly like this:

ComponentTypical sizeNotes
System prompt500-2,000 tokensInstructions, personality, constraints
Tool definitions1,000-5,000 tokensScales with number of available tools
Conversation history2,000-50,000 tokensGrows with each turn
Tool results500-20,000+ tokens per callLargest variable; a single file read can be huge
Agent reasoning1,000-5,000 tokens per turnChain-of-thought, planning

The biggest offender is almost always tool results. A single file read can consume thousands of tokens. A search across a codebase might return hundreds of matches. Without careful management, a few tool calls can eat your entire budget.

Summarization and compression strategies

The core idea is simple: don’t keep raw data in context when a summary will do.

Summarize tool results immediately

When a skill returns large results, summarize them before they enter the agent’s working memory. This can happen at the skill level (the skill itself returns a summary) or at the orchestration level (a post-processing step compresses the output).

// Bad: returning raw file contents into context
async function readFile(path: string): Promise<ToolResult> {
  const content = await fs.readFile(path, "utf-8");
  return { content }; // Could be 10,000+ tokens
}

// Better: return with metadata that helps the agent decide what to keep
async function readFile(
  path: string,
  options?: ReadOptions,
): Promise<ToolResult> {
  const content = await fs.readFile(path, "utf-8");
  const lines = content.split("\n");

  if (options?.summaryOnly) {
    return {
      path,
      lineCount: lines.length,
      language: detectLanguage(path),
      exports: extractExports(content),
      imports: extractImports(content),
      summary: `${lines.length} lines of ${detectLanguage(path)}. Key exports: ${extractExports(content).join(", ")}`,
    };
  }

  // If full content requested but file is large, truncate with guidance
  if (lines.length > 200) {
    return {
      path,
      content: lines.slice(0, 200).join("\n"),
      truncated: true,
      totalLines: lines.length,
      message:
        "File truncated at 200 lines. Use offset parameter to read specific sections.",
    };
  }

  return { path, content, truncated: false };
}

Progressive detail loading

Start with high-level summaries and drill down only where needed. This is the single most effective strategy for context management.

async def explore_codebase(path: str) -> dict:
    """Level 1: Directory structure overview."""
    tree = await invoke("list_directory", path=path, recursive=True, depth=2)
    return {
        "structure": tree,
        "file_count": count_files(tree),
        "languages": detect_languages(tree),
        "hint": "Use read_file_summary for details on specific files.",
    }

async def read_file_summary(path: str) -> dict:
    """Level 2: File-level summary without full content."""
    content = await read_file(path)
    return {
        "path": path,
        "line_count": len(content.splitlines()),
        "functions": extract_function_signatures(content),
        "classes": extract_class_names(content),
        "imports": extract_imports(content),
        "hint": "Use read_file_section to read specific functions or line ranges.",
    }

async def read_file_section(path: str, start: int, end: int) -> dict:
    """Level 3: Specific section of a file."""
    lines = (await read_file(path)).splitlines()
    return {
        "path": path,
        "range": f"lines {start}-{end} of {len(lines)}",
        "content": "\n".join(lines[start:end]),
    }

This three-level approach (overview, summary, detail) lets the agent navigate a large codebase while keeping context usage proportional to what it actually needs.

Selective context loading

Not everything needs to be in context at once. Skills should load information on demand rather than preloading everything upfront.

Pattern: lazy loading with caching

class ContextManager {
  private cache = new Map<string, { data: unknown; accessedAt: Date }>();
  private maxCacheSize: number;

  constructor(maxCacheSize = 20) {
    this.maxCacheSize = maxCacheSize;
  }

  async get(key: string, loader: () => Promise<unknown>): Promise<unknown> {
    if (this.cache.has(key)) {
      const entry = this.cache.get(key)!;
      entry.accessedAt = new Date();
      return entry.data;
    }

    // Evict least recently accessed if at capacity
    if (this.cache.size >= this.maxCacheSize) {
      this.evictLeastRecent();
    }

    const data = await loader();
    this.cache.set(key, { data, accessedAt: new Date() });
    return data;
  }

  private evictLeastRecent(): void {
    let oldestKey = "";
    let oldestTime = new Date();
    for (const [key, entry] of this.cache) {
      if (entry.accessedAt < oldestTime) {
        oldestTime = entry.accessedAt;
        oldestKey = key;
      }
    }
    if (oldestKey) this.cache.delete(oldestKey);
  }
}

Pattern: relevance-based filtering

When a search returns many results, filter by relevance before adding them to context. This is especially important for search skills that might match hundreds of files.

def filter_search_results(
    results: list[SearchResult],
    query_context: str,
    max_results: int = 10,
) -> list[SearchResult]:
    """Filter search results to the most relevant subset."""
    scored = []
    for result in results:
        score = 0
        # Exact filename match scores highest
        if query_context.lower() in result.path.lower():
            score += 10
        # Results in src/ are usually more relevant than node_modules/
        if "/src/" in result.path:
            score += 5
        if "node_modules" in result.path or "vendor" in result.path:
            score -= 20
        # More recent files are often more relevant
        if result.modified_days_ago < 7:
            score += 3

        scored.append((score, result))

    scored.sort(key=lambda x: x[0], reverse=True)
    return [result for _, result in scored[:max_results]]

Memory patterns: short-term vs. long-term

Agent skills need different memory strategies depending on how long the information needs to stick around. For a non-technical introduction to these concepts, Agent Memory Patterns explains how agents remember and forget in plain language.

Short-term memory: conversation context

Short-term memory lives in the current conversation. It’s fast and directly accessible, but ephemeral and limited by the context window. Most workflow state lives here.

Best practices for short-term memory:

  • Summarize completed steps rather than keeping full results
  • Drop intermediate results once downstream steps have consumed them
  • Use structured summaries the agent can quickly scan
// Instead of keeping all raw results:
const rawResults = {
  step1: {
    /* 2000 tokens of data */
  },
  step2: {
    /* 3000 tokens of data */
  },
  step3: {
    /* 1500 tokens of data */
  },
};

// Maintain a running summary:
const workingSummary = {
  completedSteps: ["fetch_data", "validate_schema", "transform"],
  keyFindings: [
    "Schema has 3 breaking changes in users table",
    "47 records failed date format validation",
    "Transform produced 10,234 clean records",
  ],
  nextStep: "load_to_destination",
  blockers: [],
};

Long-term memory: persistent storage

For information that persists across conversations (user preferences, project context, learned patterns), use external storage accessed through skills. This avoids loading everything into context upfront.

class ProjectMemory:
    """Persistent memory for project-specific context."""

    def __init__(self, storage_path: str):
        self.storage_path = storage_path

    async def remember(self, key: str, value: str, category: str = "general") -> None:
        """Store a fact for later retrieval."""
        memories = await self._load()
        memories[key] = {
            "value": value,
            "category": category,
            "stored_at": datetime.now().isoformat(),
        }
        await self._save(memories)

    async def recall(self, category: str | None = None, query: str | None = None) -> list[dict]:
        """Retrieve relevant memories, optionally filtered."""
        memories = await self._load()
        results = []
        for key, entry in memories.items():
            if category and entry["category"] != category:
                continue
            if query and query.lower() not in entry["value"].lower():
                continue
            results.append({"key": key, **entry})
        return results

Choosing the right memory strategy

What you needStrategyExample
Current task stateShort-term (context)Workflow progress, intermediate results
File contents being editedShort-term with evictionKeep only the files currently being modified
Project structureLong-term, loaded on demandDirectory layout, tech stack, conventions
User preferencesLong-term, loaded at startCoding style, preferred tools, common paths
Previous conversation outcomesLong-term, searched when relevantPast decisions, resolved issues

Context window recovery

When context is running low mid-task, skills need strategies to keep going.

Pattern: context compression checkpoint

When a workflow detects it’s approaching context limits, it should compress its state before continuing.

function compressWorkflowState(ctx: WorkflowContext): WorkflowContext {
  // Replace detailed step results with summaries
  for (const [step, result] of ctx.stepResults) {
    if (typeof result === "object" && result !== null) {
      ctx.stepResults.set(step, {
        summary: result.summary || `Step ${step} completed successfully`,
        keyOutputs: extractKeyOutputs(result),
        // Drop raw data, keep only what downstream steps need
      });
    }
  }

  return ctx;
}

This is where context management connects directly to error handling. Running out of context mid-workflow is a failure mode your skills should anticipate and handle gracefully, not silently degrade through.

Context is the most constrained resource your agent has, and most skills waste it without realizing it. The teams that build agents capable of handling large, multi-file tasks are the ones that treat every token like it costs something. Return summaries instead of raw data, load details only when you need them, and assume that context will run low before the job is done. The agent that plans for a tight budget outperforms the one that assumes infinite room every time.