# Incident triage

## Purpose

Triage a production incident by gathering context from logs, recent deployments, and health endpoints. Produce a structured summary with severity assessment and suggested next steps. This skill is read-only: it investigates but never takes action to fix anything.

### Returns

A markdown block with these fields, in this order:

- `Service`: the affected service or component name
- `Severity`: one of SEV-1/2/3/4 with a one-line reason
- `Status`: ongoing, intermittent, or resolved
- `Started`: timestamp or estimate of first error
- `Duration`: time elapsed since first error
- `Timeline`: bullet list of timestamped events
- `What we know`: 2-4 sentences summarizing evidence
- `Likely cause`: best assessment grounded in the evidence
- `Suggested next steps`: 1-3 numbered actions for a human to take
- `Raw evidence`: collapsible blocks of relevant log excerpts and health output

Downstream skills (escalation, postmortem, status-page updaters) can rely on these fields.

## When to use

- The user reports a production alert or error
- The user pastes an error message or alert payload
- The user asks "what's going on with [service]?"
- The user asks for help triaging an incident

## When not to use

- The user wants you to fix the issue, not investigate it. This skill triages and recommends. Hand off to a separate skill, or to a human, for any remediation step.
- The affected service is in a non-production environment where read-only constraints don't matter and faster iteration would help more.
- The alert payload includes credentials, secrets, or sensitive customer data. Running log commands could surface them in tool output. Stop and ask the user how to proceed.
- You don't have read access to the affected service's logs or git repo. Without context, the triage will be guesswork. Ask the user for access first, or hand off.

## Steps

### 1. Parse the alert

Read the alert or error message the user provided. Extract:

- Service or component name (from the alert source, hostname, or error context)
- Error type (HTTP 5xx, timeout, OOM, crash, connection refused, etc.)
- Timestamp (when the alert fired, if included)
- Affected endpoint or function (if mentioned)

If the user gave a vague description like "the API is down," ask one clarifying question: "Which service or URL is affected?" Do not ask more than one question. Work with what you have.

### 2. Check recent deployments

Before running any git commands, confirm with the user that the working directory is the affected service's repository. Ask: "Should I run git commands in this directory, or is the repo somewhere else?" If git is not installed or the directory isn't a git repo, skip this step and note the gap in the timeline.

Then run `git log --oneline --since="24 hours ago"` to see what was deployed in the last day.

If the project uses tags for releases, also run `git tag --sort=-creatordate | head -5` to see recent release tags.

For each recent commit or deploy, note:

- When it was deployed
- What changed (from the commit message)
- Whether the change touched the affected service or component

If a deployment happened shortly before the alert, flag it as a likely contributor.

### 3. Search logs for related errors

Run a log search for the error. The exact command depends on the environment. Try these in order, using whichever one works.

For Docker containers:

```bash
docker logs <container_name> --since 1h 2>&1 | tail -200
```

For systemd services:

```bash
journalctl -u <service_name> --since "1 hour ago" --no-pager | tail -200
```

For log files:

```bash
tail -500 /var/log/<service>/error.log
```

If none of those apply, ask the user: "Where are your logs? Give me a path, container name, or command to pull them."

For high-volume services, do not pipe full log streams into your context. Filter for the error string first, then pull surrounding lines:

```bash
docker logs <container_name> --since 1h 2>&1 \
  | grep -i "<error_string>" -A 5 -B 2 | head -100
```

From the log output, extract:

- The first occurrence of the error (to establish when it started)
- The frequency (every request, intermittent, or one-off?)
- Any stack traces or error details
- Any upstream errors (database connections, external API failures, DNS issues)

Before including any log lines in the triage summary, scan them for credentials, API keys, tokens, connection strings, or PII. Redact anything sensitive. The summary will likely be pasted into Slack, a ticket, or an incident-management tool, and you do not want secrets riding along.

### 4. Check health endpoints

If the service has health check URLs, check them. Derive the base URL from the alert payload or from the user's input rather than assuming `localhost`. If the agent is running on a development machine or CI host, `localhost` will resolve to the wrong process.

```bash
curl -s -o /dev/null -w "%{http_code} %{time_total}s" \
  https://<service-host>/health
```

Also check common health paths: `/health`, `/healthz`, `/api/health`, `/status`, `/ready`.

Record:

- HTTP status code
- Response time
- Whether the response body indicates degraded components

If you do not know the service URL, ask the user or skip this step.

### 5. Check resource usage

If you have shell access to the affected host, sample current resource usage. Use a two-sample `top` so the second reading reflects current activity rather than since-boot averages:

```bash
# CPU and memory (use the second sample, not the first)
top -bn2 -d 0.5 | tail -20

# Disk usage
df -h

# Open connections
ss -s
```

Note anything unusual: high CPU, memory near capacity, disk full, connection count spikes.

### 6. Build a timeline

From the information gathered in steps 1 through 5, construct a timeline:

- When was the last successful state (if determinable)?
- When did the first error appear in logs?
- Were there any deployments between the last success and the first error?
- Is the error ongoing or did it resolve?

### 7. Assess severity

Based on the evidence, assign a severity:

- SEV-1 (Critical): Service is fully down. Users cannot complete core actions. Data loss is occurring or likely.
- SEV-2 (High): Service is degraded. Some users are affected, or a non-core function is completely broken.
- SEV-3 (Medium): Errors are occurring but the service is functional. Performance is degraded but usable.
- SEV-4 (Low): Cosmetic or minor issue. No user impact yet, but the error rate is elevated.

### 8. Produce the triage summary

Output the summary in this format:

```
## Incident triage

**Service:** [name]
**Severity:** [SEV-1/2/3/4]: [one-line reason]
**Status:** [ongoing / intermittent / resolved]
**Started:** [timestamp or estimate]
**Duration:** [how long so far]

### Timeline

- [timestamp] [event]
- [timestamp] [event]
- ...

### What we know

[2-4 sentences summarizing the evidence. What is broken, how it
manifests, and who is affected.]

### Likely cause

[Your best assessment based on the evidence. Be specific.
"The deployment at 14:32 changed the database connection pool
configuration, and the first connection timeout appeared at 14:35"
is better than "something might be wrong with the database."]

### Suggested next steps (human actions)

1. [Most important action for a human to take now]
2. [Second priority]
3. [Third priority]

### Raw evidence

<details>
<summary>Log excerpts</summary>

[Relevant log lines, redacted of any credentials or PII]

</details>

<details>
<summary>Health check results</summary>

[Output from health checks]

</details>
```

## Important rules

- NEVER take remediation actions. Do not restart services, roll back deployments, modify configuration, or change anything. This skill is investigation only.
- NEVER guess when you can check. If you can run a command to verify something, run it.
- NEVER bury the severity. Put it at the top of the summary so the on-call engineer sees it immediately.
- NEVER include credentials, API keys, tokens, or PII in the triage summary. Scan log excerpts before pasting them into Raw evidence and redact anything sensitive.
- Suggested next steps describe what a human should do, not what you have done or will do. The "(human actions)" label exists so the on-call engineer is never unsure who is acting.
- If you cannot access logs or health endpoints, say so explicitly. Do not fabricate evidence.
- If the evidence is ambiguous, say "likely cause" not "root cause." You are triaging, not concluding.
- Keep the triage summary under 50 lines. The on-call engineer needs to read it in 30 seconds.