Skill: incident triage

Your pager goes off at 2 AM. You open your laptop, squint at the alert, and start the same routine: check the logs, check recent deploys, check the health endpoints, piece together a timeline. This skill turns that routine into an automated first-response runbook. Hand it an alert or error message, and it gathers context from logs, git history, and service health checks, then produces a triage summary with a severity assessment and suggested next steps. It is strictly read-only: it investigates but never remediates.

The skill file

Copy the following into .claude/skills/incident-triage.md. This is the complete skill definition.

# Incident triage

## Purpose

Triage a production incident by gathering context from logs, recent deployments, and health endpoints. Produce a structured summary with severity assessment and suggested next steps. This skill is read-only: it investigates but never takes action to fix anything.

### Returns

A markdown block with these fields, in this order:

- `Service`: the affected service or component name
- `Severity`: one of SEV-1/2/3/4 with a one-line reason
- `Status`: ongoing, intermittent, or resolved
- `Started`: timestamp or estimate of first error
- `Duration`: time elapsed since first error
- `Timeline`: bullet list of timestamped events
- `What we know`: 2-4 sentences summarizing evidence
- `Likely cause`: best assessment grounded in the evidence
- `Suggested next steps`: 1-3 numbered actions for a human to take
- `Raw evidence`: collapsible blocks of relevant log excerpts and health output

Downstream skills (escalation, postmortem, status-page updaters) can rely on these fields.

## When to use

- The user reports a production alert or error
- The user pastes an error message or alert payload
- The user asks "what's going on with [service]?"
- The user asks for help triaging an incident

## When not to use

- The user wants you to fix the issue, not investigate it. This skill triages and recommends. Hand off to a separate skill, or to a human, for any remediation step.
- The affected service is in a non-production environment where read-only constraints don't matter and faster iteration would help more.
- The alert payload includes credentials, secrets, or sensitive customer data. Running log commands could surface them in tool output. Stop and ask the user how to proceed.
- You don't have read access to the affected service's logs or git repo. Without context, the triage will be guesswork. Ask the user for access first, or hand off.

## Steps

### 1. Parse the alert

Read the alert or error message the user provided. Extract:

- Service or component name (from the alert source, hostname, or error context)
- Error type (HTTP 5xx, timeout, OOM, crash, connection refused, etc.)
- Timestamp (when the alert fired, if included)
- Affected endpoint or function (if mentioned)

If the user gave a vague description like "the API is down," ask one clarifying question: "Which service or URL is affected?" Do not ask more than one question. Work with what you have.

### 2. Check recent deployments

Before running any git commands, confirm with the user that the working directory is the affected service's repository. Ask: "Should I run git commands in this directory, or is the repo somewhere else?" If git is not installed or the directory isn't a git repo, skip this step and note the gap in the timeline.

Then run `git log --oneline --since="24 hours ago"` to see what was deployed in the last day.

If the project uses tags for releases, also run `git tag --sort=-creatordate | head -5` to see recent release tags.

For each recent commit or deploy, note:

- When it was deployed
- What changed (from the commit message)
- Whether the change touched the affected service or component

If a deployment happened shortly before the alert, flag it as a likely contributor.

### 3. Search logs for related errors

Run a log search for the error. The exact command depends on the environment. Try these in order, using whichever one works.

For Docker containers:

```bash
docker logs <container_name> --since 1h 2>&1 | tail -200
```

For systemd services:

```bash
journalctl -u <service_name> --since "1 hour ago" --no-pager | tail -200
```

For log files:

```bash
tail -500 /var/log/<service>/error.log
```

If none of those apply, ask the user: "Where are your logs? Give me a path, container name, or command to pull them."

For high-volume services, do not pipe full log streams into your context. Filter for the error string first, then pull surrounding lines:

```bash
docker logs <container_name> --since 1h 2>&1 \
  | grep -i "<error_string>" -A 5 -B 2 | head -100
```

From the log output, extract:

- The first occurrence of the error (to establish when it started)
- The frequency (every request, intermittent, or one-off?)
- Any stack traces or error details
- Any upstream errors (database connections, external API failures, DNS issues)

Before including any log lines in the triage summary, scan them for credentials, API keys, tokens, connection strings, or PII. Redact anything sensitive. The summary will likely be pasted into Slack, a ticket, or an incident-management tool, and you do not want secrets riding along.

### 4. Check health endpoints

If the service has health check URLs, check them. Derive the base URL from the alert payload or from the user's input rather than assuming `localhost`. If the agent is running on a development machine or CI host, `localhost` will resolve to the wrong process.

```bash
curl -s -o /dev/null -w "%{http_code} %{time_total}s" \
  https://<service-host>/health
```

Also check common health paths: `/health`, `/healthz`, `/api/health`, `/status`, `/ready`.

Record:

- HTTP status code
- Response time
- Whether the response body indicates degraded components

If you do not know the service URL, ask the user or skip this step.

### 5. Check resource usage

If you have shell access to the affected host, sample current resource usage. Use a two-sample `top` so the second reading reflects current activity rather than since-boot averages:

```bash
# CPU and memory (use the second sample, not the first)
top -bn2 -d 0.5 | tail -20

# Disk usage
df -h

# Open connections
ss -s
```

Note anything unusual: high CPU, memory near capacity, disk full, connection count spikes.

### 6. Build a timeline

From the information gathered in steps 1 through 5, construct a timeline:

- When was the last successful state (if determinable)?
- When did the first error appear in logs?
- Were there any deployments between the last success and the first error?
- Is the error ongoing or did it resolve?

### 7. Assess severity

Based on the evidence, assign a severity:

- SEV-1 (Critical): Service is fully down. Users cannot complete core actions. Data loss is occurring or likely.
- SEV-2 (High): Service is degraded. Some users are affected, or a non-core function is completely broken.
- SEV-3 (Medium): Errors are occurring but the service is functional. Performance is degraded but usable.
- SEV-4 (Low): Cosmetic or minor issue. No user impact yet, but the error rate is elevated.

### 8. Produce the triage summary

Output the summary in this format:

```
## Incident triage

**Service:** [name]
**Severity:** [SEV-1/2/3/4]: [one-line reason]
**Status:** [ongoing / intermittent / resolved]
**Started:** [timestamp or estimate]
**Duration:** [how long so far]

### Timeline

- [timestamp] [event]
- [timestamp] [event]
- ...

### What we know

[2-4 sentences summarizing the evidence. What is broken, how it
manifests, and who is affected.]

### Likely cause

[Your best assessment based on the evidence. Be specific.
"The deployment at 14:32 changed the database connection pool
configuration, and the first connection timeout appeared at 14:35"
is better than "something might be wrong with the database."]

### Suggested next steps (human actions)

1. [Most important action for a human to take now]
2. [Second priority]
3. [Third priority]

### Raw evidence

<details>
<summary>Log excerpts</summary>

[Relevant log lines, redacted of any credentials or PII]

</details>

<details>
<summary>Health check results</summary>

[Output from health checks]

</details>
```

## Important rules

- NEVER take remediation actions. Do not restart services, roll back deployments, modify configuration, or change anything. This skill is investigation only.
- NEVER guess when you can check. If you can run a command to verify something, run it.
- NEVER bury the severity. Put it at the top of the summary so the on-call engineer sees it immediately.
- NEVER include credentials, API keys, tokens, or PII in the triage summary. Scan log excerpts before pasting them into Raw evidence and redact anything sensitive.
- Suggested next steps describe what a human should do, not what you have done or will do. The "(human actions)" label exists so the on-call engineer is never unsure who is acting.
- If you cannot access logs or health endpoints, say so explicitly. Do not fabricate evidence.
- If the evidence is ambiguous, say "likely cause" not "root cause." You are triaging, not concluding.
- Keep the triage summary under 50 lines. The on-call engineer needs to read it in 30 seconds.

How to install it

Create the skill file in your project:

mkdir -p .claude/skills
# paste the skill content into this file
touch .claude/skills/incident-triage.md

For a personal skill that works across all your projects, place it in ~/.claude/skills/incident-triage.md instead.

Example usage

A PagerDuty alert lands and you paste it into your agent:

> The payments API is returning 502s. Started about 10 minutes ago.
> Here's the alert: "payments-api health check failed, HTTP 502,
> threshold 3 consecutive failures"

Parsing alert... service: payments-api, error: HTTP 502, onset: ~10 min ago.

Confirm working directory: I'm in ~/code/payments-api. Checking git history.

Checking recent deployments...
Found 2 commits in the last 24 hours:
  af3e21 feat(payments): switch to connection pooling for Stripe client
  b8d4a1 chore: update nginx config for /api/v2 routes

Searching logs (filtered for 5xx)...
docker logs payments-api --since 1h | grep "5[0-9][0-9]" -A 2:
  First error at 14:32:07: "connection pool exhausted, max 5 reached"
  78 occurrences in the last 10 minutes
  All errors reference the Stripe client connection pool

Checking health endpoints at https://payments-api.internal...
  /health: 502, 12.4s response time
  /ready: 502

## Incident triage

**Service:** payments-api
**Severity:** SEV-2: payment processing is failing for all users
**Status:** ongoing
**Started:** ~14:32 UTC (10 minutes ago)
**Duration:** 10 minutes

### Timeline

- 14:22 Deployment af3e21 (connection pooling for Stripe client)
- 14:32 First "connection pool exhausted" error
- 14:32 Health check starts failing
- 14:42 Alert fires after 3 consecutive failures

### What we know

All requests to the payments API are failing with 502. The logs
show "connection pool exhausted" errors starting 10 minutes after
a deployment that changed the Stripe client to use connection
pooling. The pool max is set to 5, which appears too low for
current traffic.

### Likely cause

Deployment af3e21 introduced a connection pool for the Stripe
client with a max size of 5. Under normal traffic, this pool
is being exhausted, causing all subsequent requests to fail.

### Suggested next steps (human actions)

1. Roll back deployment af3e21 to restore the previous Stripe
   client configuration
2. After rollback, increase the pool size (try 25) and redeploy
3. Add a connection pool metric to monitoring to catch this earlier

How it works

The most important constraint in the skill sits at the bottom of the rules section, but it’s worth surfacing first: this skill is read-only. An agent that restarts your production database at 2 AM because it thought that would help is worse than no agent at all. Triage gathers evidence and makes recommendations. The human decides what to do next. For more on this approval-gate pattern, see human-in-the-loop patterns for AI agents.

The rest of the design follows the mental model an experienced on-call engineer uses. Step 1 extracts structured information from the alert before doing anything else. I’ve watched agents jump straight to log searches and waste two minutes grepping the wrong service because the structured-extraction step got skipped. It costs nothing and saves time on every incident.

Step 2 checks git history before searching logs, and the ordering is deliberate. The most common cause of production incidents is “somebody deployed something.” Checking deployments first lets the agent correlate timestamps when it reads the logs in step 3. Step 3 then tries Docker logs, journalctl, and log files in sequence, since not every project runs the same way. The skill tries the common sources and falls back to asking only when none of them work.

Step 7 defines specific severity levels with clear criteria. Without them, the agent would either panic about everything or understate problems. The four-level scale gives the on-call engineer an immediate sense of urgency. For more on why explicit constraints like this matter for skills, see skill design principles.

What this skill cannot do, and what to plan around: it depends on the agent having read access to logs, the git repo, and (optionally) health endpoints. If your environment hides logs behind Kubernetes RBAC or your repos live on a separate machine, the skill will need to ask you to paste relevant data in. That’s by design, but it’s worth knowing before you install.

For the longer take on agent-led incident triage and where it fits next to alerting and remediation tooling, see AI agents for DevOps and SRE teams.

Customizing it

If you use Datadog, Grafana, Prometheus, or any monitoring platform with an HTTP API, add commands to step 5 so the agent can pull metrics directly. For Datadog the right shape is a curl against the metrics-query endpoint with your API and application keys; for Prometheus it’s a curl against the /api/v1/query endpoint. Replace the placeholder with whatever your platform actually supports. The more data sources the agent can check, the better the triage.

If your team maintains runbooks, add a step between 7 and 8 that asks the agent to check for a runbook at docs/runbooks/[service-name].md. When one exists, include the link in the suggested next steps. This wires the triage into your existing processes instead of replacing them.

The default severity levels are deliberately generic. Replace them with your team’s definitions. If your org defines SEV-1 as “revenue-impacting” and SEV-2 as “user-facing but not revenue,” update the criteria to match. See skill design principles for why your criteria should be specific rather than borrowed.

To have the agent suggest (but not execute) notification commands, extend step 8 with a “Notify” section listing the Slack or PagerDuty command the engineer should run to escalate. The agent still does not execute it, just tells you what to run.

If you only want this skill to run for certain services, add a list at the top of the file: “This skill covers payments-api, user-service, and auth-gateway. For other services, tell the user you don’t have context.” This prevents the agent from guessing about systems it knows nothing about.

The point of all this

On-call is exhausting because the work is mechanical: you do the same checks in the same order every time. That’s exactly the work to delegate. Hand the alert to the agent, let it do the legwork, and read the summary while you’re still putting on your glasses. The judgment about what to actually do still belongs to you.