Error handling patterns for AI agents
How to build agent skills that handle failures gracefully: retry strategies, fallbacks, partial completion, and informative error responses.
On this page
- Error categories
- Retry strategies with backoff
- Exponential backoff with jitter
- When to stop retrying
- Fallback skill alternatives
- Pattern: ordered fallback chain
- Pattern: degraded results
- User-facing error messages
- Anatomy of a good error response
- Including similar matches
- Partial completion and recovery
- Pattern: step-level error isolation
- Reporting partial results
The agent tries to read a config file that was deleted ten minutes ago. It gets back a raw “ENOENT” error and has no idea what to do next. So it tries again. Same error. It tries a third time because why not. Then it tells the user “something went wrong” and gives up. Meanwhile, the file it needed was just renamed, and a one-line suggestion in the error response (“did you mean config.yaml?”) would have fixed everything on the first attempt.
This is the difference between an agent that recovers from problems and one that gets stuck on them. This article covers the essential error handling patterns for agent skills: retry strategies, fallback alternatives, user-facing error messages, and partial completion recovery. These patterns are the foundation for building multi-step workflows that can survive real-world conditions.
Error categories
Before building error handling, it helps to categorize the kinds of failures your skills will hit. Different failure modes need different responses.
| Category | Examples | Typical response |
|---|---|---|
| Transient | Network timeout, rate limit, temporary service outage | Retry with backoff |
| Input error | Invalid file path, malformed query, missing parameter | Return clear error, suggest correction |
| Permission | Access denied, authentication expired | Escalate to user |
| Resource | File not found, database unavailable, disk full | Fallback or escalate |
| Logic | Unexpected data format, conflicting state | Log and escalate |
| Catastrophic | Out of memory, unhandled exception | Fail safely, preserve state |
The first step in any error handling strategy is classifying the error so you can route it to the right handler.
enum ErrorCategory {
TRANSIENT = "transient",
INPUT = "input",
PERMISSION = "permission",
RESOURCE = "resource",
LOGIC = "logic",
CATASTROPHIC = "catastrophic",
}
function categorizeError(error: Error): ErrorCategory {
if (
error.message.includes("ECONNRESET") ||
error.message.includes("timeout")
) {
return ErrorCategory.TRANSIENT;
}
if (error.message.includes("rate limit") || error.message.includes("429")) {
return ErrorCategory.TRANSIENT;
}
if (error.message.includes("ENOENT") || error.message.includes("not found")) {
return ErrorCategory.RESOURCE;
}
if (error.message.includes("EACCES") || error.message.includes("403")) {
return ErrorCategory.PERMISSION;
}
if (
error.message.includes("invalid") ||
error.message.includes("malformed")
) {
return ErrorCategory.INPUT;
}
return ErrorCategory.LOGIC;
}
Retry strategies with backoff
Transient errors are the most common failure mode, and retries are the first line of defense. But naive retries (immediately hitting the same request again) often make things worse. They can overwhelm a struggling service or burn through rate limits faster.
Exponential backoff with jitter
The standard approach is exponential backoff: wait 1 second, then 2, then 4, then 8. Adding jitter (randomness) prevents multiple clients from retrying at exactly the same time, which would create a thundering herd.
const sleep = (ms: number) =>
new Promise<void>((resolve) => setTimeout(resolve, ms));
interface RetryConfig {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
jitter: boolean;
}
async function withRetry<T>(
fn: () => Promise<T>,
config: RetryConfig = {
maxAttempts: 3,
baseDelayMs: 1000,
maxDelayMs: 30000,
jitter: true,
},
): Promise<T> {
let lastError: Error;
for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
// Only retry transient errors
if (categorizeError(lastError) !== ErrorCategory.TRANSIENT) {
throw lastError;
}
if (attempt === config.maxAttempts) {
throw lastError;
}
// Calculate delay with exponential backoff
let delay = Math.min(
config.baseDelayMs * Math.pow(2, attempt - 1),
config.maxDelayMs,
);
// Add jitter: random value between 0 and the calculated delay
if (config.jitter) {
delay = Math.random() * delay;
}
await sleep(delay);
}
}
throw lastError!;
}
The same pattern in Python:
import asyncio
import random
def categorize_error(err: BaseException) -> str:
"""Match the four-category model used in the TypeScript example."""
msg = str(err).lower()
if any(s in msg for s in ("timeout", "econnreset", "rate limit", "429", "503")):
return "transient"
if any(s in msg for s in ("not found", "no such file", "404")):
return "not_found"
if any(s in msg for s in ("permission", "forbidden", "401", "403")):
return "permission"
return "permanent"
async def with_retry(
fn,
max_attempts: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
jitter: bool = True,
):
last_error = None
for attempt in range(1, max_attempts + 1):
try:
return await fn()
except Exception as e:
last_error = e
if categorize_error(e) != "transient":
raise
if attempt == max_attempts:
raise
delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
if jitter:
delay = random.uniform(0, delay)
await asyncio.sleep(delay)
raise last_error
When to stop retrying
Retries shouldn’t go on forever. Define clear stopping conditions:
- Maximum attempts reached, typically 3-5 for API calls
- Total elapsed time exceeded. Don’t retry for 10 minutes if the user expects a response in seconds.
- Error category changed. If a transient error becomes a permission error, stop retrying.
- Context budget running low. Each retry consumes agent context; see Context Management.
After exhausting retries, escalate to the user with a clear explanation rather than failing silently. This connects to the escalation patterns in Human-in-the-Loop.
Fallback skill alternatives
When the primary approach fails, a well-designed agent has alternative strategies. This is the skill equivalent of graceful degradation.
Pattern: ordered fallback chain
async function searchCode(query: string): Promise<SearchResult> {
const strategies = [
{
name: "ripgrep_search",
fn: () => invoke("ripgrep", { pattern: query }),
when: "Ripgrep is available and fastest",
},
{
name: "grep_search",
fn: () => invoke("grep", { pattern: query, recursive: true }),
when: "Fallback when ripgrep is not installed",
},
{
name: "manual_search",
fn: () => invoke("read_and_search", { pattern: query }),
when: "Last resort: read files one by one and search in memory",
},
];
for (const strategy of strategies) {
try {
const result = await strategy.fn();
return {
...result,
strategy: strategy.name,
};
} catch (error) {
// Log which strategy failed and why, then try the next one
console.log(`Strategy ${strategy.name} failed: ${error}`);
continue;
}
}
return {
success: false,
error: "All search strategies failed",
strategiesAttempted: strategies.map((s) => s.name),
};
}
Pattern: degraded results
Sometimes you can return a partial or lower-quality result instead of failing entirely.
async def get_file_info(path: str) -> dict:
"""Get file information with graceful degradation."""
result = {"path": path}
# Try to get full metadata
try:
stat = await invoke("file_stat", path=path)
result["size"] = stat.size
result["modified"] = stat.modified
result["permissions"] = stat.permissions
except Exception:
# Fall back to just checking existence
result["exists"] = await invoke("file_exists", path=path)
result["metadata_available"] = False
# Try to detect language
try:
result["language"] = detect_language(path)
except Exception:
result["language"] = "unknown"
# Try to get line count (requires reading the file)
try:
content = await invoke("read_file", path=path)
result["line_count"] = len(content.splitlines())
except Exception:
result["line_count"] = None
return result
This approach returns whatever information it can gather, even if some parts fail. The caller gets a useful (if incomplete) result instead of an error.
User-facing error messages
When an error does reach the user, the message should be helpful, not cryptic. The agent is the go-between for your skill and the human, so your error messages need to give the agent enough to explain the problem and suggest next steps.
Anatomy of a good error response
interface SkillError {
/** What went wrong, in plain language */
message: string;
/** The error category for routing */
category: ErrorCategory;
/** What the agent (or user) can do about it */
suggestion: string;
/** Whether retrying might help */
retryable: boolean;
/** Technical details for debugging (optional) */
details?: Record<string, unknown>;
}
Bad error response:
{
"error": "ENOENT: no such file or directory, open '/src/confg.ts'"
}
Good error response:
{
"message": "File not found: /src/confg.ts",
"category": "resource",
"suggestion": "Check if the filename is spelled correctly (did you mean 'config.ts'?). Use the search_files skill to find files matching 'config'.",
"retryable": false,
"details": {
"path": "/src/confg.ts",
"similar_files": ["/src/config.ts", "/src/config.json"]
}
}
The good response tells the agent exactly what went wrong, offers concrete next steps, and even suggests a likely correction. That’s the difference between an agent that gets stuck on errors and one that recovers on its own.
Including similar matches
One of the most useful things an error response can include is a guess at what the user probably meant. This is especially helpful for file paths, command names, and other inputs where typos are common.
from difflib import get_close_matches
def file_not_found_error(requested_path: str, available_files: list[str]) -> dict:
"""Build a helpful error when a file is not found."""
filename = os.path.basename(requested_path)
available_names = [os.path.basename(f) for f in available_files]
similar = get_close_matches(filename, available_names, n=3, cutoff=0.6)
parts = []
if similar:
matching_paths = [f for f in available_files if os.path.basename(f) in similar]
parts.append(f"Did you mean one of these? {', '.join(matching_paths)}.")
parts.append("Use search_files to find the file by pattern.")
return {
"message": f"File not found: {requested_path}",
"category": "resource",
"suggestion": " ".join(parts),
"retryable": False,
"details": {"similar_files": similar},
}
Partial completion and recovery
In multi-step workflows, a failure in step 3 shouldn’t throw away the work done in steps 1 and 2. Partial completion preserves progress and lets the agent (or user) decide how to proceed.
Pattern: step-level error isolation
interface StepOutcome {
step: string;
status: "success" | "failed" | "skipped";
result?: unknown;
error?: SkillError;
}
async function executeWorkflowSteps(
steps: WorkflowStep[],
ctx: WorkflowContext,
): Promise<StepOutcome[]> {
const outcomes: StepOutcome[] = [];
for (const step of steps) {
// Check if this step's dependencies succeeded
const dependenciesMet = step.dependencies.every((dep) =>
outcomes.find((o) => o.step === dep && o.status === "success"),
);
if (!dependenciesMet) {
outcomes.push({
step: step.name,
status: "skipped",
error: {
message: `Skipped: dependency ${step.dependencies.find((d) => !outcomes.find((o) => o.step === d && o.status === "success"))} did not succeed`,
category: ErrorCategory.LOGIC,
suggestion: "Fix the failed dependency and retry the workflow.",
retryable: true,
},
});
continue;
}
try {
const result = await withRetry(() => step.execute(ctx));
outcomes.push({ step: step.name, status: "success", result });
} catch (error) {
outcomes.push({
step: step.name,
status: "failed",
error: buildSkillError(error as Error),
});
// If this step is critical, stop the workflow
if (step.critical) {
break;
}
// Otherwise, continue with remaining steps
}
}
return outcomes;
}
This pattern lets non-critical steps fail without stopping everything. A code review workflow might continue even if the complexity analysis step fails. The lint results and test coverage are still worth having on their own.
Reporting partial results
When a workflow completes partially, the skill should clearly communicate what succeeded, what failed, and what got skipped.
def format_partial_result(outcomes: list[StepOutcome]) -> dict:
succeeded = [o for o in outcomes if o.status == "success"]
failed = [o for o in outcomes if o.status == "failed"]
skipped = [o for o in outcomes if o.status == "skipped"]
return {
"completed": len(succeeded) == len(outcomes),
"summary": (
f"{len(succeeded)} of {len(outcomes)} steps completed. "
f"{len(failed)} failed, {len(skipped)} skipped."
),
"succeeded": [o.step for o in succeeded],
"failed": [
{"step": o.step, "error": o.error.message}
for o in failed
],
"skipped": [o.step for o in skipped],
"suggestion": _generate_recovery_suggestion(failed),
}
The pattern that runs through all of this is the same: errors are data, not dead ends. When your skill tells the agent what went wrong, whether it should retry, and what else it could try instead, the agent can reason its way through failures the way a good developer would. When your skill just throws an exception and hopes for the best, the agent flails. Build the recovery path into the error itself, and you’ll be surprised how rarely failures actually require human intervention.
For tracking how these error patterns behave in production, Observability for Agents covers the metrics, tracing, and alerting you need to catch failures before your users do. For how error handling composes at the orchestration level, agent orchestration patterns shows how fan-out/fan-in handles partial failures and how pipelines implement per-stage retry with backoff.
Related articles
AI agent skill anti-patterns to avoid
Common mistakes in agent skill design: the god tool, leaky abstractions, over-parameterization, and patterns that lead to unreliable agents.
Context management for AI agents
Strategies for working within context window limits: summarization, selective loading, and memory patterns for agent skills.
Human-in-the-loop patterns for AI agents
Patterns for involving humans in agent workflows: approval gates, progressive autonomy, and knowing when to escalate.