Observability for agents

Your agent was supposed to search the docs and answer a question. Instead it searched the web, hallucinated an answer, and confidently told the user something wrong. The logs say “request completed successfully.” Now what?

This is the core problem with agents in production. Traditional software is deterministic: given the same input, you get the same output, and you can trace every step. Agents make decisions internally. Which tool to call, what parameters to use, how to interpret results, whether to try again or give up. Most of that decision-making is invisible by default. Your HTTP logs show a 200 OK, and somewhere behind that 200, your agent went off the rails.

I’ve debugged enough agent failures to have a strong opinion about this: if you can’t see what your agent decided at each step and why, you don’t have a production system. You have a demo with a prayer attached.

What you actually need to log

Most teams start by logging the final input and output. That’s like debugging a function by only looking at the arguments and the return value. It works for simple cases, but agents aren’t simple cases.

Here’s what I log for every agent step, and I mean every step:

Tool selection. Which tool did the model pick, and what was its reasoning? If your framework exposes the model’s chain-of-thought or tool-selection rationale, capture it. When your agent calls web_search instead of doc_search, you need to know why it made that choice.

Tool inputs. The exact parameters sent to the tool. Not a summary. The actual payload. When someone reports “it searched for the wrong thing,” you want to see the exact query string.

Tool outputs. What came back from the tool. For large responses, store the full output but log a truncated version (first 500 characters plus total length). You’ll need the full version during replay, but your log dashboard doesn’t need 50KB per entry.

Conversation context. The full message history at each step. This is expensive to store but critical for understanding why the model made a particular decision. The model’s behavior at step 5 depends entirely on what happened at steps 1 through 4.

Timing. How long each step took, broken into model inference time and tool execution time. A step that usually takes 200ms but suddenly takes 8 seconds is a signal, even if the output looks correct.

Token counts. Input and output tokens per step. A sudden spike in token usage usually means the context got bloated, often because a tool returned an unexpectedly large response that the model then had to process.

Structured logging format

Plain text logs are useless for agent debugging. You need structured data so you can filter, aggregate, and query. Here’s what a single agent step looks like in my logging format:

{
  "trace_id": "tr_8f2a1b3c",
  "span_id": "sp_004",
  "parent_span_id": "sp_003",
  "timestamp": "2026-03-15T14:32:01.847Z",
  "step_index": 3,
  "event": "tool_call",
  "tool": "doc_search",
  "input": {
    "query": "authentication setup guide",
    "collection": "product_docs",
    "limit": 5
  },
  "output_preview": "Found 5 results. Top result: 'Setting up SSO with SAML 2.0' (score: 0.89)...",
  "output_bytes": 12847,
  "duration_ms": 342,
  "model_reasoning": "User asked about authentication. Searching product docs first before trying web search.",
  "tokens": {
    "input": 2150,
    "output": 84
  },
  "context_messages": 7
}

Every field is queryable. When something goes wrong, you can filter by trace_id to see the full run, sort by duration_ms to find slow steps, or search model_reasoning to understand decision patterns across many runs.

The model_reasoning field deserves special attention. Not every framework gives you this, but if yours does (or if you can extract it from the model’s response before tool calls), it’s the single most valuable debugging signal you have. It tells you what the model was thinking, not just what it did.

Tracing across multi-step runs

A single agent run might involve 3 steps or 30. You need to connect them into a coherent story. This is where distributed tracing patterns from the microservices world apply directly.

Each agent invocation gets a trace_id. Each step within that invocation gets a span_id. If a step triggers sub-steps (like a planning step that spawns multiple tool calls), the sub-steps reference a parent_span_id. This is just OpenTelemetry applied to agent flows.

If you’re already using OpenTelemetry in your stack, you can instrument your agent framework with it directly. Several agent frameworks have OTel plugins or built-in support. If you’re building from scratch, the pattern is straightforward:

import { trace, SpanKind } from "@opentelemetry/api";

const tracer = trace.getTracer("agent-service");

async function runAgentStep(
  traceId: string,
  stepIndex: number,
  tool: string,
  input: Record<string, unknown>,
) {
  return tracer.startActiveSpan(
    `agent.step.${tool}`,
    { kind: SpanKind.INTERNAL },
    async (span) => {
      span.setAttribute("agent.trace_id", traceId);
      span.setAttribute("agent.step_index", stepIndex);
      span.setAttribute("agent.tool", tool);
      span.setAttribute("agent.tool_input", JSON.stringify(input));

      const startTime = performance.now();
      const result = await executeTool(tool, input);
      const duration = performance.now() - startTime;

      span.setAttribute("agent.duration_ms", duration);
      span.setAttribute(
        "agent.output_preview",
        JSON.stringify(result).slice(0, 500),
      );
      span.end();
      return result;
    },
  );
}

The payoff comes when you visualize these traces. Tools like Jaeger or Grafana Tempo will show you a waterfall view of every step in an agent run, with timing, inputs, and outputs. You can spot the exact moment things went wrong instead of guessing.

Replay: reproducing failures without calling the LLM

Here’s a pattern that has saved me hours of debugging. If you log enough data, you can replay a failing agent interaction without making any LLM API calls.

The idea is simple: during the original run, you captured the model’s output at every step (which tool it chose, what parameters it used) and every tool’s response. To replay, you feed the same user input, stub out the model with the recorded decisions, and stub out the tools with the recorded responses. The agent walks through the exact same path it took in production.

This gives you a deterministic reproduction. You can set breakpoints, add more logging, inspect intermediate state. No flaky behavior from the model, no network calls, no API costs.

interface RecordedStep {
  tool: string;
  input: Record<string, unknown>;
  output: unknown;
  modelReasoning: string;
}

class ReplayAgent {
  private steps: RecordedStep[];
  private currentStep = 0;

  constructor(recordedTrace: RecordedStep[]) {
    this.steps = recordedTrace;
  }

  async selectTool(): Promise<{
    tool: string;
    input: Record<string, unknown>;
  }> {
    const step = this.steps[this.currentStep];
    return { tool: step.tool, input: step.input };
  }

  async executeTool(
    tool: string,
    input: Record<string, unknown>,
  ): Promise<unknown> {
    const step = this.steps[this.currentStep];
    this.currentStep++;
    return step.output;
  }
}

Store your recorded traces for at least 30 days. When a user reports “the agent gave me a wrong answer last Tuesday,” you can pull up the exact trace and replay it locally. Without replay data, you’re stuck trying to reproduce a non-deterministic bug with different model state and different tool outputs. Good luck with that.

Alerting: what to watch in production

Logging is useless if nobody looks at it. You need alerts that fire when agent behavior drifts outside normal bounds. Here’s what I monitor:

Tool failure rates. If your doc_search tool usually fails 2% of the time and suddenly it’s at 15%, something is broken. This is standard service monitoring, but I’ve seen teams forget to apply it to their tool layer.

Average step count per run. Track this as a rolling average. If your agent normally completes in 4-6 steps and starts averaging 12, it’s probably stuck in a retry loop or going down unproductive paths. A sudden increase in step count is one of the earliest signals that something has changed, maybe a tool’s output format shifted, or the system prompt was edited.

Latency per step. Both model inference time and tool execution time. Model latency spikes might mean your context is too large. Tool latency spikes point to downstream service issues.

Token usage trends. Plot total tokens per run over time. A gradual upward trend means your prompts or contexts are growing. A sudden spike means something dumped a huge payload into the context. Either way, you’ll want to investigate before your API bill surprises you.

Fallback rates. How often does the agent fall back to a secondary tool or a “I don’t know” response? A rising fallback rate means your primary tools are becoming less effective, possibly because the data they search over has changed or grown.

Set alerts on all of these with thresholds based on your baseline. I prefer anomaly detection over fixed thresholds because agent behavior is inherently variable. A 50% increase over the 7-day rolling average is a good starting point.

A minimal logging wrapper

Here’s a practical TypeScript wrapper that captures all of this without requiring you to rewrite your agent framework:

interface AgentLog {
  traceId: string;
  spanId: string;
  stepIndex: number;
  timestamp: string;
  tool: string;
  input: Record<string, unknown>;
  outputPreview: string;
  outputBytes: number;
  durationMs: number;
  tokens: { input: number; output: number };
  modelReasoning?: string;
  error?: string;
}

function createAgentLogger(traceId: string) {
  let stepIndex = 0;

  return {
    async wrapToolCall<T>(
      tool: string,
      input: Record<string, unknown>,
      fn: () => Promise<T>,
      metadata?: {
        reasoning?: string;
        tokens?: { input: number; output: number };
      },
    ): Promise<T> {
      const spanId = `sp_${String(stepIndex).padStart(3, "0")}`;
      const start = performance.now();

      try {
        const result = await fn();
        const output = JSON.stringify(result);

        const log: AgentLog = {
          traceId,
          spanId,
          stepIndex: stepIndex++,
          timestamp: new Date().toISOString(),
          tool,
          input,
          outputPreview: output.slice(0, 500),
          outputBytes: output.length,
          durationMs: Math.round(performance.now() - start),
          tokens: metadata?.tokens ?? { input: 0, output: 0 },
          modelReasoning: metadata?.reasoning,
        };

        console.log(JSON.stringify(log));
        return result;
      } catch (err) {
        const log: AgentLog = {
          traceId,
          spanId,
          stepIndex: stepIndex++,
          timestamp: new Date().toISOString(),
          tool,
          input,
          outputPreview: "",
          outputBytes: 0,
          durationMs: Math.round(performance.now() - start),
          tokens: metadata?.tokens ?? { input: 0, output: 0 },
          error: err instanceof Error ? err.message : String(err),
        };

        console.log(JSON.stringify(log));
        throw err;
      }
    },
  };
}

Usage looks like this:

const logger = createAgentLogger("tr_" + crypto.randomUUID().slice(0, 8));

const results = await logger.wrapToolCall(
  "doc_search",
  { query: "authentication setup", limit: 5 },
  () => docSearch({ query: "authentication setup", limit: 5 }),
  { reasoning: "User asked about auth, searching docs first" },
);

Every tool call gets logged with full context. When something breaks, you have the data to figure out why. When nothing breaks, you have the data to understand performance trends and optimize.

Where to go from here

Start by adding structured logging to your agent’s tool calls. That alone will save you hours the first time something goes wrong. Then add tracing so you can see multi-step runs as a connected story. Replay comes third, once you have enough logged data to make it work.

For the orchestration patterns that produce these multi-step runs in the first place, agent orchestration patterns covers fan-out/fan-in, pipelines, and supervisor agents, each of which generates distinct tracing shapes worth designing for in advance.

If you’re still building your agent and haven’t thought about error handling yet, read error handling patterns before you add observability. Logging errors is only useful if your agent actually handles them in a recoverable way. For general debugging strategies during development, see testing and debugging agents. And if you’re running agents in a DevOps context where observability matters even more, check out agents for DevOps and the incident triage skill that puts these patterns to work in an on-call workflow.