Building Technical design errors reliability resilience

Error handling patterns for AI agents

How to build agent skills that handle failures gracefully: retry strategies, fallbacks, partial completion, and informative error responses.

4 min read
On this page

The agent tries to read a config file that was deleted ten minutes ago. It gets back a raw “ENOENT” error and has no idea what to do next. So it tries again. Same error. It tries a third time because why not. Then it tells the user “something went wrong” and gives up. Meanwhile, the file it needed was just renamed, and a one-line suggestion in the error response (“did you mean config.yaml?”) would have fixed everything on the first attempt.

This is the difference between an agent that recovers from problems and one that gets stuck on them. This article covers the essential error handling patterns for agent skills: retry strategies, fallback alternatives, user-facing error messages, and partial completion recovery. These patterns are the foundation for building multi-step workflows that can survive real-world conditions.

Error categories

Before building error handling, it helps to categorize the kinds of failures your skills will hit. Different failure modes need different responses.

CategoryExamplesTypical response
TransientNetwork timeout, rate limit, temporary service outageRetry with backoff
Input errorInvalid file path, malformed query, missing parameterReturn clear error, suggest correction
PermissionAccess denied, authentication expiredEscalate to user
ResourceFile not found, database unavailable, disk fullFallback or escalate
LogicUnexpected data format, conflicting stateLog and escalate
CatastrophicOut of memory, unhandled exceptionFail safely, preserve state

The first step in any error handling strategy is classifying the error so you can route it to the right handler.

enum ErrorCategory {
  TRANSIENT = "transient",
  INPUT = "input",
  PERMISSION = "permission",
  RESOURCE = "resource",
  LOGIC = "logic",
  CATASTROPHIC = "catastrophic",
}

function categorizeError(error: Error): ErrorCategory {
  if (
    error.message.includes("ECONNRESET") ||
    error.message.includes("timeout")
  ) {
    return ErrorCategory.TRANSIENT;
  }
  if (error.message.includes("rate limit") || error.message.includes("429")) {
    return ErrorCategory.TRANSIENT;
  }
  if (error.message.includes("ENOENT") || error.message.includes("not found")) {
    return ErrorCategory.RESOURCE;
  }
  if (error.message.includes("EACCES") || error.message.includes("403")) {
    return ErrorCategory.PERMISSION;
  }
  if (
    error.message.includes("invalid") ||
    error.message.includes("malformed")
  ) {
    return ErrorCategory.INPUT;
  }
  return ErrorCategory.LOGIC;
}

Retry strategies with backoff

Transient errors are the most common failure mode, and retries are the first line of defense. But naive retries (immediately hitting the same request again) often make things worse. They can overwhelm a struggling service or burn through rate limits faster.

Exponential backoff with jitter

The standard approach is exponential backoff: wait 1 second, then 2, then 4, then 8. Adding jitter (randomness) prevents multiple clients from retrying at exactly the same time, which would create a thundering herd.

const sleep = (ms: number) =>
  new Promise<void>((resolve) => setTimeout(resolve, ms));

interface RetryConfig {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitter: boolean;
}

async function withRetry<T>(
  fn: () => Promise<T>,
  config: RetryConfig = {
    maxAttempts: 3,
    baseDelayMs: 1000,
    maxDelayMs: 30000,
    jitter: true,
  },
): Promise<T> {
  let lastError: Error;

  for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      // Only retry transient errors
      if (categorizeError(lastError) !== ErrorCategory.TRANSIENT) {
        throw lastError;
      }

      if (attempt === config.maxAttempts) {
        throw lastError;
      }

      // Calculate delay with exponential backoff
      let delay = Math.min(
        config.baseDelayMs * Math.pow(2, attempt - 1),
        config.maxDelayMs,
      );

      // Add jitter: random value between 0 and the calculated delay
      if (config.jitter) {
        delay = Math.random() * delay;
      }

      await sleep(delay);
    }
  }

  throw lastError!;
}

The same pattern in Python:

import asyncio
import random


def categorize_error(err: BaseException) -> str:
    """Match the four-category model used in the TypeScript example."""
    msg = str(err).lower()
    if any(s in msg for s in ("timeout", "econnreset", "rate limit", "429", "503")):
        return "transient"
    if any(s in msg for s in ("not found", "no such file", "404")):
        return "not_found"
    if any(s in msg for s in ("permission", "forbidden", "401", "403")):
        return "permission"
    return "permanent"


async def with_retry(
    fn,
    max_attempts: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    jitter: bool = True,
):
    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            return await fn()
        except Exception as e:
            last_error = e

            if categorize_error(e) != "transient":
                raise

            if attempt == max_attempts:
                raise

            delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
            if jitter:
                delay = random.uniform(0, delay)

            await asyncio.sleep(delay)

    raise last_error

When to stop retrying

Retries shouldn’t go on forever. Define clear stopping conditions:

  • Maximum attempts reached, typically 3-5 for API calls
  • Total elapsed time exceeded. Don’t retry for 10 minutes if the user expects a response in seconds.
  • Error category changed. If a transient error becomes a permission error, stop retrying.
  • Context budget running low. Each retry consumes agent context; see Context Management.

After exhausting retries, escalate to the user with a clear explanation rather than failing silently. This connects to the escalation patterns in Human-in-the-Loop.

Fallback skill alternatives

When the primary approach fails, a well-designed agent has alternative strategies. This is the skill equivalent of graceful degradation.

Pattern: ordered fallback chain

async function searchCode(query: string): Promise<SearchResult> {
  const strategies = [
    {
      name: "ripgrep_search",
      fn: () => invoke("ripgrep", { pattern: query }),
      when: "Ripgrep is available and fastest",
    },
    {
      name: "grep_search",
      fn: () => invoke("grep", { pattern: query, recursive: true }),
      when: "Fallback when ripgrep is not installed",
    },
    {
      name: "manual_search",
      fn: () => invoke("read_and_search", { pattern: query }),
      when: "Last resort: read files one by one and search in memory",
    },
  ];

  for (const strategy of strategies) {
    try {
      const result = await strategy.fn();
      return {
        ...result,
        strategy: strategy.name,
      };
    } catch (error) {
      // Log which strategy failed and why, then try the next one
      console.log(`Strategy ${strategy.name} failed: ${error}`);
      continue;
    }
  }

  return {
    success: false,
    error: "All search strategies failed",
    strategiesAttempted: strategies.map((s) => s.name),
  };
}

Pattern: degraded results

Sometimes you can return a partial or lower-quality result instead of failing entirely.

async def get_file_info(path: str) -> dict:
    """Get file information with graceful degradation."""
    result = {"path": path}

    # Try to get full metadata
    try:
        stat = await invoke("file_stat", path=path)
        result["size"] = stat.size
        result["modified"] = stat.modified
        result["permissions"] = stat.permissions
    except Exception:
        # Fall back to just checking existence
        result["exists"] = await invoke("file_exists", path=path)
        result["metadata_available"] = False

    # Try to detect language
    try:
        result["language"] = detect_language(path)
    except Exception:
        result["language"] = "unknown"

    # Try to get line count (requires reading the file)
    try:
        content = await invoke("read_file", path=path)
        result["line_count"] = len(content.splitlines())
    except Exception:
        result["line_count"] = None

    return result

This approach returns whatever information it can gather, even if some parts fail. The caller gets a useful (if incomplete) result instead of an error.

User-facing error messages

When an error does reach the user, the message should be helpful, not cryptic. The agent is the go-between for your skill and the human, so your error messages need to give the agent enough to explain the problem and suggest next steps.

Anatomy of a good error response

interface SkillError {
  /** What went wrong, in plain language */
  message: string;
  /** The error category for routing */
  category: ErrorCategory;
  /** What the agent (or user) can do about it */
  suggestion: string;
  /** Whether retrying might help */
  retryable: boolean;
  /** Technical details for debugging (optional) */
  details?: Record<string, unknown>;
}

Bad error response:

{
  "error": "ENOENT: no such file or directory, open '/src/confg.ts'"
}

Good error response:

{
  "message": "File not found: /src/confg.ts",
  "category": "resource",
  "suggestion": "Check if the filename is spelled correctly (did you mean 'config.ts'?). Use the search_files skill to find files matching 'config'.",
  "retryable": false,
  "details": {
    "path": "/src/confg.ts",
    "similar_files": ["/src/config.ts", "/src/config.json"]
  }
}

The good response tells the agent exactly what went wrong, offers concrete next steps, and even suggests a likely correction. That’s the difference between an agent that gets stuck on errors and one that recovers on its own.

Including similar matches

One of the most useful things an error response can include is a guess at what the user probably meant. This is especially helpful for file paths, command names, and other inputs where typos are common.

from difflib import get_close_matches

def file_not_found_error(requested_path: str, available_files: list[str]) -> dict:
    """Build a helpful error when a file is not found."""
    filename = os.path.basename(requested_path)
    available_names = [os.path.basename(f) for f in available_files]
    similar = get_close_matches(filename, available_names, n=3, cutoff=0.6)

    parts = []
    if similar:
        matching_paths = [f for f in available_files if os.path.basename(f) in similar]
        parts.append(f"Did you mean one of these? {', '.join(matching_paths)}.")
    parts.append("Use search_files to find the file by pattern.")

    return {
        "message": f"File not found: {requested_path}",
        "category": "resource",
        "suggestion": " ".join(parts),
        "retryable": False,
        "details": {"similar_files": similar},
    }

Partial completion and recovery

In multi-step workflows, a failure in step 3 shouldn’t throw away the work done in steps 1 and 2. Partial completion preserves progress and lets the agent (or user) decide how to proceed.

Pattern: step-level error isolation

interface StepOutcome {
  step: string;
  status: "success" | "failed" | "skipped";
  result?: unknown;
  error?: SkillError;
}

async function executeWorkflowSteps(
  steps: WorkflowStep[],
  ctx: WorkflowContext,
): Promise<StepOutcome[]> {
  const outcomes: StepOutcome[] = [];

  for (const step of steps) {
    // Check if this step's dependencies succeeded
    const dependenciesMet = step.dependencies.every((dep) =>
      outcomes.find((o) => o.step === dep && o.status === "success"),
    );

    if (!dependenciesMet) {
      outcomes.push({
        step: step.name,
        status: "skipped",
        error: {
          message: `Skipped: dependency ${step.dependencies.find((d) => !outcomes.find((o) => o.step === d && o.status === "success"))} did not succeed`,
          category: ErrorCategory.LOGIC,
          suggestion: "Fix the failed dependency and retry the workflow.",
          retryable: true,
        },
      });
      continue;
    }

    try {
      const result = await withRetry(() => step.execute(ctx));
      outcomes.push({ step: step.name, status: "success", result });
    } catch (error) {
      outcomes.push({
        step: step.name,
        status: "failed",
        error: buildSkillError(error as Error),
      });

      // If this step is critical, stop the workflow
      if (step.critical) {
        break;
      }
      // Otherwise, continue with remaining steps
    }
  }

  return outcomes;
}

This pattern lets non-critical steps fail without stopping everything. A code review workflow might continue even if the complexity analysis step fails. The lint results and test coverage are still worth having on their own.

Reporting partial results

When a workflow completes partially, the skill should clearly communicate what succeeded, what failed, and what got skipped.

def format_partial_result(outcomes: list[StepOutcome]) -> dict:
    succeeded = [o for o in outcomes if o.status == "success"]
    failed = [o for o in outcomes if o.status == "failed"]
    skipped = [o for o in outcomes if o.status == "skipped"]

    return {
        "completed": len(succeeded) == len(outcomes),
        "summary": (
            f"{len(succeeded)} of {len(outcomes)} steps completed. "
            f"{len(failed)} failed, {len(skipped)} skipped."
        ),
        "succeeded": [o.step for o in succeeded],
        "failed": [
            {"step": o.step, "error": o.error.message}
            for o in failed
        ],
        "skipped": [o.step for o in skipped],
        "suggestion": _generate_recovery_suggestion(failed),
    }

The pattern that runs through all of this is the same: errors are data, not dead ends. When your skill tells the agent what went wrong, whether it should retry, and what else it could try instead, the agent can reason its way through failures the way a good developer would. When your skill just throws an exception and hopes for the best, the agent flails. Build the recovery path into the error itself, and you’ll be surprised how rarely failures actually require human intervention.

For tracking how these error patterns behave in production, Observability for Agents covers the metrics, tracing, and alerting you need to catch failures before your users do. For how error handling composes at the orchestration level, agent orchestration patterns shows how fan-out/fan-in handles partial failures and how pipelines implement per-stage retry with backoff.