Error Handling Patterns
How to build agent skills that handle failures gracefully: retry strategies, fallbacks, partial completion, and informative error responses.
Every skill will fail eventually. APIs go down. Files get deleted. Network connections drop. Rate limits kick in. The difference between a fragile agent and a reliable one isn’t whether failures happen. It’s how the agent responds when they do.
This article covers the essential error handling patterns for agent skills: retry strategies, fallback alternatives, user-facing error messages, and partial completion recovery. These patterns are the foundation for building multi-step workflows that can survive real-world conditions.
Error categories
Before building error handling, it helps to categorize the kinds of failures your skills will hit. Different failure modes need different responses.
| Category | Examples | Typical response |
|---|---|---|
| Transient | Network timeout, rate limit, temporary service outage | Retry with backoff |
| Input error | Invalid file path, malformed query, missing parameter | Return clear error, suggest correction |
| Permission | Access denied, authentication expired | Escalate to user |
| Resource | File not found, database unavailable, disk full | Fallback or escalate |
| Logic | Unexpected data format, conflicting state | Log and escalate |
| Catastrophic | Out of memory, unhandled exception | Fail safely, preserve state |
The first step in any error handling strategy is classifying the error so you can route it to the right handler.
enum ErrorCategory {
TRANSIENT = "transient",
INPUT = "input",
PERMISSION = "permission",
RESOURCE = "resource",
LOGIC = "logic",
CATASTROPHIC = "catastrophic",
}
function categorizeError(error: Error): ErrorCategory {
if (
error.message.includes("ECONNRESET") ||
error.message.includes("timeout")
) {
return ErrorCategory.TRANSIENT;
}
if (error.message.includes("rate limit") || error.message.includes("429")) {
return ErrorCategory.TRANSIENT;
}
if (error.message.includes("ENOENT") || error.message.includes("not found")) {
return ErrorCategory.RESOURCE;
}
if (error.message.includes("EACCES") || error.message.includes("403")) {
return ErrorCategory.PERMISSION;
}
if (
error.message.includes("invalid") ||
error.message.includes("malformed")
) {
return ErrorCategory.INPUT;
}
return ErrorCategory.LOGIC;
}
Retry strategies with backoff
Transient errors are the most common failure mode, and retries are the first line of defense. But naive retries (immediately hitting the same request again) often make things worse. They can overwhelm a struggling service or burn through rate limits faster.
Exponential backoff with jitter
The standard approach is exponential backoff: wait 1 second, then 2, then 4, then 8. Adding jitter (randomness) prevents multiple clients from retrying at exactly the same time, which would create a thundering herd.
interface RetryConfig {
maxAttempts: number;
baseDelayMs: number;
maxDelayMs: number;
jitter: boolean;
}
async function withRetry<T>(
fn: () => Promise<T>,
config: RetryConfig = {
maxAttempts: 3,
baseDelayMs: 1000,
maxDelayMs: 30000,
jitter: true,
},
): Promise<T> {
let lastError: Error;
for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
// Only retry transient errors
if (categorizeError(lastError) !== ErrorCategory.TRANSIENT) {
throw lastError;
}
if (attempt === config.maxAttempts) {
throw lastError;
}
// Calculate delay with exponential backoff
let delay = Math.min(
config.baseDelayMs * Math.pow(2, attempt - 1),
config.maxDelayMs,
);
// Add jitter: random value between 0 and the calculated delay
if (config.jitter) {
delay = Math.random() * delay;
}
await sleep(delay);
}
}
throw lastError!;
}
The same pattern in Python:
import asyncio
import random
async def with_retry(
fn,
max_attempts: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
jitter: bool = True,
):
last_error = None
for attempt in range(1, max_attempts + 1):
try:
return await fn()
except Exception as e:
last_error = e
if categorize_error(e) != "transient":
raise
if attempt == max_attempts:
raise
delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
if jitter:
delay = random.uniform(0, delay)
await asyncio.sleep(delay)
raise last_error
When to stop retrying
Retries shouldn’t go on forever. Define clear stopping conditions:
- Maximum attempts reached, typically 3-5 for API calls
- Total elapsed time exceeded. Don’t retry for 10 minutes if the user expects a response in seconds.
- Error category changed. If a transient error becomes a permission error, stop retrying.
- Context budget running low. Each retry consumes agent context; see Context Management.
After exhausting retries, escalate to the user with a clear explanation rather than failing silently. This connects to the escalation patterns in Human-in-the-Loop.
Fallback skill alternatives
When the primary approach fails, a well-designed agent has alternative strategies. This is the skill equivalent of graceful degradation.
Pattern: ordered fallback chain
async function searchCode(query: string): Promise<SearchResult> {
const strategies = [
{
name: "ripgrep_search",
fn: () => invoke("ripgrep", { pattern: query }),
when: "Ripgrep is available and fastest",
},
{
name: "grep_search",
fn: () => invoke("grep", { pattern: query, recursive: true }),
when: "Fallback when ripgrep is not installed",
},
{
name: "manual_search",
fn: () => invoke("read_and_search", { pattern: query }),
when: "Last resort: read files one by one and search in memory",
},
];
for (const strategy of strategies) {
try {
const result = await strategy.fn();
return {
...result,
strategy: strategy.name,
};
} catch (error) {
// Log which strategy failed and why, then try the next one
console.log(`Strategy ${strategy.name} failed: ${error}`);
continue;
}
}
return {
success: false,
error: "All search strategies failed",
strategiesAttempted: strategies.map((s) => s.name),
};
}
Pattern: degraded results
Sometimes you can return a partial or lower-quality result instead of failing entirely.
async def get_file_info(path: str) -> dict:
"""Get file information with graceful degradation."""
result = {"path": path}
# Try to get full metadata
try:
stat = await invoke("file_stat", path=path)
result["size"] = stat.size
result["modified"] = stat.modified
result["permissions"] = stat.permissions
except Exception:
# Fall back to just checking existence
result["exists"] = await invoke("file_exists", path=path)
result["metadata_available"] = False
# Try to detect language
try:
result["language"] = detect_language(path)
except Exception:
result["language"] = "unknown"
# Try to get line count (requires reading the file)
try:
content = await invoke("read_file", path=path)
result["line_count"] = len(content.splitlines())
except Exception:
result["line_count"] = None
return result
This approach returns whatever information it can gather, even if some parts fail. The caller gets a useful (if incomplete) result instead of an error.
User-facing error messages
When an error does reach the user, the message should be helpful, not cryptic. The agent is the go-between for your skill and the human, so your error messages need to give the agent enough to explain the problem and suggest next steps.
Anatomy of a good error response
interface SkillError {
/** What went wrong, in plain language */
message: string;
/** The error category for routing */
category: ErrorCategory;
/** What the agent (or user) can do about it */
suggestions: string[];
/** Whether retrying might help */
retryable: boolean;
/** Technical details for debugging (optional) */
details?: Record<string, unknown>;
}
Bad error response:
{
"error": "ENOENT: no such file or directory, open '/src/confg.ts'"
}
Good error response:
{
"message": "File not found: /src/confg.ts",
"category": "resource",
"suggestions": [
"Check if the filename is spelled correctly (did you mean 'config.ts'?)",
"Use the search_files skill to find files matching 'config'"
],
"retryable": false,
"details": {
"path": "/src/confg.ts",
"similar_files": ["/src/config.ts", "/src/config.json"]
}
}
The good response tells the agent exactly what went wrong, offers concrete next steps, and even suggests a likely correction. That’s the difference between an agent that gets stuck on errors and one that recovers on its own.
Including similar matches
One of the most useful things an error response can include is a guess at what the user probably meant. This is especially helpful for file paths, command names, and other inputs where typos are common.
from difflib import get_close_matches
def file_not_found_error(requested_path: str, available_files: list[str]) -> dict:
"""Build a helpful error when a file is not found."""
filename = os.path.basename(requested_path)
available_names = [os.path.basename(f) for f in available_files]
similar = get_close_matches(filename, available_names, n=3, cutoff=0.6)
suggestions = []
if similar:
matching_paths = [f for f in available_files if os.path.basename(f) in similar]
suggestions.append(f"Did you mean one of these? {', '.join(matching_paths)}")
suggestions.append("Use search_files to find the file by pattern.")
return {
"message": f"File not found: {requested_path}",
"category": "resource",
"suggestions": suggestions,
"retryable": False,
"details": {"similar_files": similar},
}
Partial completion and recovery
In multi-step workflows, a failure in step 3 shouldn’t throw away the work done in steps 1 and 2. Partial completion preserves progress and lets the agent (or user) decide how to proceed.
Pattern: step-level error isolation
interface StepOutcome {
step: string;
status: "success" | "failed" | "skipped";
result?: unknown;
error?: SkillError;
}
async function executeWorkflowSteps(
steps: WorkflowStep[],
ctx: WorkflowContext,
): Promise<StepOutcome[]> {
const outcomes: StepOutcome[] = [];
for (const step of steps) {
// Check if this step's dependencies succeeded
const dependenciesMet = step.dependencies.every((dep) =>
outcomes.find((o) => o.step === dep && o.status === "success"),
);
if (!dependenciesMet) {
outcomes.push({
step: step.name,
status: "skipped",
error: {
message: `Skipped: dependency ${step.dependencies.find((d) => !outcomes.find((o) => o.step === d && o.status === "success"))} did not succeed`,
category: ErrorCategory.LOGIC,
suggestions: ["Fix the failed dependency and retry the workflow"],
retryable: true,
},
});
continue;
}
try {
const result = await withRetry(() => step.execute(ctx));
outcomes.push({ step: step.name, status: "success", result });
} catch (error) {
outcomes.push({
step: step.name,
status: "failed",
error: buildSkillError(error as Error),
});
// If this step is critical, stop the workflow
if (step.critical) {
break;
}
// Otherwise, continue with remaining steps
}
}
return outcomes;
}
This pattern lets non-critical steps fail without stopping everything. A code review workflow might continue even if the complexity analysis step fails. The lint results and test coverage are still worth having on their own.
Reporting partial results
When a workflow completes partially, the skill should clearly communicate what succeeded, what failed, and what got skipped.
def format_partial_result(outcomes: list[StepOutcome]) -> dict:
succeeded = [o for o in outcomes if o.status == "success"]
failed = [o for o in outcomes if o.status == "failed"]
skipped = [o for o in outcomes if o.status == "skipped"]
return {
"completed": len(succeeded) == len(outcomes),
"summary": (
f"{len(succeeded)} of {len(outcomes)} steps completed. "
f"{len(failed)} failed, {len(skipped)} skipped."
),
"succeeded": [o.step for o in succeeded],
"failed": [
{"step": o.step, "error": o.error.message}
for o in failed
],
"skipped": [o.step for o in skipped],
"suggestions": _generate_recovery_suggestions(failed),
}
Key takeaways
-
Categorize errors first, then route them. Transient errors get retries. Input errors get correction suggestions. Permission errors get escalated.
-
Use exponential backoff with jitter for retries. Naive immediate retries make things worse. Always cap the number of attempts and total elapsed time.
-
Build fallback chains. When the primary approach fails, have alternative strategies ready. Return degraded results rather than nothing.
-
Make error messages actionable. Tell the agent what went wrong, whether it should retry, and what alternatives exist. Include similar matches for likely typos.
-
Preserve partial progress. In multi-step workflows, isolate failures to individual steps. Report what succeeded alongside what failed so the agent can make an informed decision about how to proceed.
-
Know when to stop. After exhausting retries and fallbacks, escalate to the user with a clear explanation. Endless automatic recovery wastes context and time.