Technical design errors reliability resilience

Error Handling Patterns

How to build agent skills that handle failures gracefully: retry strategies, fallbacks, partial completion, and informative error responses.

Every skill will fail eventually. APIs go down. Files get deleted. Network connections drop. Rate limits kick in. The difference between a fragile agent and a reliable one isn’t whether failures happen. It’s how the agent responds when they do.

This article covers the essential error handling patterns for agent skills: retry strategies, fallback alternatives, user-facing error messages, and partial completion recovery. These patterns are the foundation for building multi-step workflows that can survive real-world conditions.

Error categories

Before building error handling, it helps to categorize the kinds of failures your skills will hit. Different failure modes need different responses.

CategoryExamplesTypical response
TransientNetwork timeout, rate limit, temporary service outageRetry with backoff
Input errorInvalid file path, malformed query, missing parameterReturn clear error, suggest correction
PermissionAccess denied, authentication expiredEscalate to user
ResourceFile not found, database unavailable, disk fullFallback or escalate
LogicUnexpected data format, conflicting stateLog and escalate
CatastrophicOut of memory, unhandled exceptionFail safely, preserve state

The first step in any error handling strategy is classifying the error so you can route it to the right handler.

enum ErrorCategory {
  TRANSIENT = "transient",
  INPUT = "input",
  PERMISSION = "permission",
  RESOURCE = "resource",
  LOGIC = "logic",
  CATASTROPHIC = "catastrophic",
}

function categorizeError(error: Error): ErrorCategory {
  if (
    error.message.includes("ECONNRESET") ||
    error.message.includes("timeout")
  ) {
    return ErrorCategory.TRANSIENT;
  }
  if (error.message.includes("rate limit") || error.message.includes("429")) {
    return ErrorCategory.TRANSIENT;
  }
  if (error.message.includes("ENOENT") || error.message.includes("not found")) {
    return ErrorCategory.RESOURCE;
  }
  if (error.message.includes("EACCES") || error.message.includes("403")) {
    return ErrorCategory.PERMISSION;
  }
  if (
    error.message.includes("invalid") ||
    error.message.includes("malformed")
  ) {
    return ErrorCategory.INPUT;
  }
  return ErrorCategory.LOGIC;
}

Retry strategies with backoff

Transient errors are the most common failure mode, and retries are the first line of defense. But naive retries (immediately hitting the same request again) often make things worse. They can overwhelm a struggling service or burn through rate limits faster.

Exponential backoff with jitter

The standard approach is exponential backoff: wait 1 second, then 2, then 4, then 8. Adding jitter (randomness) prevents multiple clients from retrying at exactly the same time, which would create a thundering herd.

interface RetryConfig {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitter: boolean;
}

async function withRetry<T>(
  fn: () => Promise<T>,
  config: RetryConfig = {
    maxAttempts: 3,
    baseDelayMs: 1000,
    maxDelayMs: 30000,
    jitter: true,
  },
): Promise<T> {
  let lastError: Error;

  for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      // Only retry transient errors
      if (categorizeError(lastError) !== ErrorCategory.TRANSIENT) {
        throw lastError;
      }

      if (attempt === config.maxAttempts) {
        throw lastError;
      }

      // Calculate delay with exponential backoff
      let delay = Math.min(
        config.baseDelayMs * Math.pow(2, attempt - 1),
        config.maxDelayMs,
      );

      // Add jitter: random value between 0 and the calculated delay
      if (config.jitter) {
        delay = Math.random() * delay;
      }

      await sleep(delay);
    }
  }

  throw lastError!;
}

The same pattern in Python:

import asyncio
import random

async def with_retry(
    fn,
    max_attempts: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    jitter: bool = True,
):
    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            return await fn()
        except Exception as e:
            last_error = e

            if categorize_error(e) != "transient":
                raise

            if attempt == max_attempts:
                raise

            delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
            if jitter:
                delay = random.uniform(0, delay)

            await asyncio.sleep(delay)

    raise last_error

When to stop retrying

Retries shouldn’t go on forever. Define clear stopping conditions:

  • Maximum attempts reached, typically 3-5 for API calls
  • Total elapsed time exceeded. Don’t retry for 10 minutes if the user expects a response in seconds.
  • Error category changed. If a transient error becomes a permission error, stop retrying.
  • Context budget running low. Each retry consumes agent context; see Context Management.

After exhausting retries, escalate to the user with a clear explanation rather than failing silently. This connects to the escalation patterns in Human-in-the-Loop.

Fallback skill alternatives

When the primary approach fails, a well-designed agent has alternative strategies. This is the skill equivalent of graceful degradation.

Pattern: ordered fallback chain

async function searchCode(query: string): Promise<SearchResult> {
  const strategies = [
    {
      name: "ripgrep_search",
      fn: () => invoke("ripgrep", { pattern: query }),
      when: "Ripgrep is available and fastest",
    },
    {
      name: "grep_search",
      fn: () => invoke("grep", { pattern: query, recursive: true }),
      when: "Fallback when ripgrep is not installed",
    },
    {
      name: "manual_search",
      fn: () => invoke("read_and_search", { pattern: query }),
      when: "Last resort: read files one by one and search in memory",
    },
  ];

  for (const strategy of strategies) {
    try {
      const result = await strategy.fn();
      return {
        ...result,
        strategy: strategy.name,
      };
    } catch (error) {
      // Log which strategy failed and why, then try the next one
      console.log(`Strategy ${strategy.name} failed: ${error}`);
      continue;
    }
  }

  return {
    success: false,
    error: "All search strategies failed",
    strategiesAttempted: strategies.map((s) => s.name),
  };
}

Pattern: degraded results

Sometimes you can return a partial or lower-quality result instead of failing entirely.

async def get_file_info(path: str) -> dict:
    """Get file information with graceful degradation."""
    result = {"path": path}

    # Try to get full metadata
    try:
        stat = await invoke("file_stat", path=path)
        result["size"] = stat.size
        result["modified"] = stat.modified
        result["permissions"] = stat.permissions
    except Exception:
        # Fall back to just checking existence
        result["exists"] = await invoke("file_exists", path=path)
        result["metadata_available"] = False

    # Try to detect language
    try:
        result["language"] = detect_language(path)
    except Exception:
        result["language"] = "unknown"

    # Try to get line count (requires reading the file)
    try:
        content = await invoke("read_file", path=path)
        result["line_count"] = len(content.splitlines())
    except Exception:
        result["line_count"] = None

    return result

This approach returns whatever information it can gather, even if some parts fail. The caller gets a useful (if incomplete) result instead of an error.

User-facing error messages

When an error does reach the user, the message should be helpful, not cryptic. The agent is the go-between for your skill and the human, so your error messages need to give the agent enough to explain the problem and suggest next steps.

Anatomy of a good error response

interface SkillError {
  /** What went wrong, in plain language */
  message: string;
  /** The error category for routing */
  category: ErrorCategory;
  /** What the agent (or user) can do about it */
  suggestions: string[];
  /** Whether retrying might help */
  retryable: boolean;
  /** Technical details for debugging (optional) */
  details?: Record<string, unknown>;
}

Bad error response:

{
  "error": "ENOENT: no such file or directory, open '/src/confg.ts'"
}

Good error response:

{
  "message": "File not found: /src/confg.ts",
  "category": "resource",
  "suggestions": [
    "Check if the filename is spelled correctly (did you mean 'config.ts'?)",
    "Use the search_files skill to find files matching 'config'"
  ],
  "retryable": false,
  "details": {
    "path": "/src/confg.ts",
    "similar_files": ["/src/config.ts", "/src/config.json"]
  }
}

The good response tells the agent exactly what went wrong, offers concrete next steps, and even suggests a likely correction. That’s the difference between an agent that gets stuck on errors and one that recovers on its own.

Including similar matches

One of the most useful things an error response can include is a guess at what the user probably meant. This is especially helpful for file paths, command names, and other inputs where typos are common.

from difflib import get_close_matches

def file_not_found_error(requested_path: str, available_files: list[str]) -> dict:
    """Build a helpful error when a file is not found."""
    filename = os.path.basename(requested_path)
    available_names = [os.path.basename(f) for f in available_files]
    similar = get_close_matches(filename, available_names, n=3, cutoff=0.6)

    suggestions = []
    if similar:
        matching_paths = [f for f in available_files if os.path.basename(f) in similar]
        suggestions.append(f"Did you mean one of these? {', '.join(matching_paths)}")
    suggestions.append("Use search_files to find the file by pattern.")

    return {
        "message": f"File not found: {requested_path}",
        "category": "resource",
        "suggestions": suggestions,
        "retryable": False,
        "details": {"similar_files": similar},
    }

Partial completion and recovery

In multi-step workflows, a failure in step 3 shouldn’t throw away the work done in steps 1 and 2. Partial completion preserves progress and lets the agent (or user) decide how to proceed.

Pattern: step-level error isolation

interface StepOutcome {
  step: string;
  status: "success" | "failed" | "skipped";
  result?: unknown;
  error?: SkillError;
}

async function executeWorkflowSteps(
  steps: WorkflowStep[],
  ctx: WorkflowContext,
): Promise<StepOutcome[]> {
  const outcomes: StepOutcome[] = [];

  for (const step of steps) {
    // Check if this step's dependencies succeeded
    const dependenciesMet = step.dependencies.every((dep) =>
      outcomes.find((o) => o.step === dep && o.status === "success"),
    );

    if (!dependenciesMet) {
      outcomes.push({
        step: step.name,
        status: "skipped",
        error: {
          message: `Skipped: dependency ${step.dependencies.find((d) => !outcomes.find((o) => o.step === d && o.status === "success"))} did not succeed`,
          category: ErrorCategory.LOGIC,
          suggestions: ["Fix the failed dependency and retry the workflow"],
          retryable: true,
        },
      });
      continue;
    }

    try {
      const result = await withRetry(() => step.execute(ctx));
      outcomes.push({ step: step.name, status: "success", result });
    } catch (error) {
      outcomes.push({
        step: step.name,
        status: "failed",
        error: buildSkillError(error as Error),
      });

      // If this step is critical, stop the workflow
      if (step.critical) {
        break;
      }
      // Otherwise, continue with remaining steps
    }
  }

  return outcomes;
}

This pattern lets non-critical steps fail without stopping everything. A code review workflow might continue even if the complexity analysis step fails. The lint results and test coverage are still worth having on their own.

Reporting partial results

When a workflow completes partially, the skill should clearly communicate what succeeded, what failed, and what got skipped.

def format_partial_result(outcomes: list[StepOutcome]) -> dict:
    succeeded = [o for o in outcomes if o.status == "success"]
    failed = [o for o in outcomes if o.status == "failed"]
    skipped = [o for o in outcomes if o.status == "skipped"]

    return {
        "completed": len(succeeded) == len(outcomes),
        "summary": (
            f"{len(succeeded)} of {len(outcomes)} steps completed. "
            f"{len(failed)} failed, {len(skipped)} skipped."
        ),
        "succeeded": [o.step for o in succeeded],
        "failed": [
            {"step": o.step, "error": o.error.message}
            for o in failed
        ],
        "skipped": [o.step for o in skipped],
        "suggestions": _generate_recovery_suggestions(failed),
    }

Key takeaways

  1. Categorize errors first, then route them. Transient errors get retries. Input errors get correction suggestions. Permission errors get escalated.

  2. Use exponential backoff with jitter for retries. Naive immediate retries make things worse. Always cap the number of attempts and total elapsed time.

  3. Build fallback chains. When the primary approach fails, have alternative strategies ready. Return degraded results rather than nothing.

  4. Make error messages actionable. Tell the agent what went wrong, whether it should retry, and what alternatives exist. Include similar matches for likely typos.

  5. Preserve partial progress. In multi-step workflows, isolate failures to individual steps. Report what succeeded alongside what failed so the agent can make an informed decision about how to proceed.

  6. Know when to stop. After exhausting retries and fallbacks, escalate to the user with a clear explanation. Endless automatic recovery wastes context and time.