Building Technical devops sre incident-response automation

AI agents for DevOps and SRE teams

Incident response, runbook automation, log analysis, and alert triage. How agent skills fit into the on-call workflow without making things worse.

6 min read
On this page

It’s 3am. PagerDuty wakes you up. The dashboard is red. You’re scrolling through logs half-asleep trying to figure out what broke. You haven’t even found the right Grafana panel yet, and you’re already ten minutes into the incident. This is where agent skills could actually help, if you set them up right.

I want to be direct about something: agents in DevOps are high-risk, high-reward. A bad agent can escalate a P3 into a P1. A good one saves you 20 minutes of context-gathering at 3am when your brain is running at maybe 40% capacity. The difference comes down to how you scope the skills and what you let the agent do on its own.

Alert triage: context before you open your laptop

The most immediately useful skill for on-call engineers is alert triage. When an alert fires, the agent reads it, pulls recent logs from the affected service, checks what was deployed in the last few hours, and gives you a summary. By the time you open your laptop, you already know what you’re dealing with.

Here’s what a triage skill definition might look like (this is pseudocode showing the structure, not a real config format):

name: triage_alert
description: >
  Given an alert, gather context from logs, recent deployments,
  and service health checks. Return a summary with likely cause.
inputs:
  alert_id: string
  service: string
  severity: enum[critical, warning, info]
steps:
  - fetch_alert_details:
      source: pagerduty
      id: "{{alert_id}}"
  - fetch_recent_logs:
      source: loki
      service: "{{service}}"
      window: 30m
      filter: "level=error OR level=warn"
  - fetch_recent_deploys:
      source: argocd
      service: "{{service}}"
      window: 4h
  - check_health_endpoints:
      service: "{{service}}"
  - summarize:
      prompt: >
        Given the alert, recent error logs, deployment history,
        and health check results, provide a 3-5 sentence summary
        of the likely issue and recommended next steps.

The output goes to Slack, tagged with the on-call engineer. You wake up, check your phone, and see: “Payment service returning 503s. Last deploy was 2 hours ago (commit abc123, changed database connection pooling config). Error logs show connection timeout to postgres-primary. Health check on /ready is failing. Likely cause: connection pool exhaustion from the config change.”

That’s worth its weight in gold at 3am. You already know where to look. For a complete, copy-and-install version of this pattern, see the incident triage skill in the skills library.

Log analysis: faster than grep when you’re half-asleep

Running grep and jq commands against log aggregators is fine when you’re alert and caffeinated. At 3am, you’re neither. A log analysis tool takes a natural language query and translates it into the right LogQL, Splunk SPL, or CloudWatch Insights query.

interface LogSearchTool {
  name: "search_logs";
  input: {
    service: string;
    timeWindow: string; // "last 30m", "last 2h", etc.
    query: string; // natural language: "errors related to database connections"
  };
  output: {
    matchingLines: LogEntry[];
    summary: string;
    suggestedQuery: string; // the actual LogQL/SPL it generated
  };
}

The key detail here: the tool should return the generated query alongside the results. You want to verify the agent searched for the right thing. If it translated “database connection errors” into {service="payment"} |= "ECONNREFUSED" but missed ETIMEDOUT, you can adjust. Transparency matters when you’re debugging production.

I’ve found this pattern works well as a second step after the triage skill runs. The triage gives you a hypothesis, the log search lets you verify it quickly.

Runbook execution: automated investigation, not remediation

This is where things get interesting and where you need to be careful. A runbook execution skill follows a predefined checklist and reports back. It checks the health endpoint, verifies database connectivity, looks at disk space, checks certificate expiration, and reports the results.

Again, pseudocode to illustrate the structure:

name: run_investigation_playbook
description: >
  Execute a read-only investigation playbook for a service.
  Report findings without taking any remediation action.
inputs:
  service: string
  playbook: enum[api_health, database, storage, networking]
playbooks:
  api_health:
    - check: HTTP GET /health
      expect: 200
    - check: HTTP GET /ready
      expect: 200
    - check: response_time /health
      expect: < 500ms
    - check: active_connections
      source: prometheus
      expect: < max_pool_size * 0.8
  database:
    - check: pg_isready
    - check: active_connections vs max_connections
    - check: replication_lag
      expect: < 30s
    - check: slow_query_log
      window: 15m
    - check: disk_usage on data volume
      expect: < 85%

Notice every step is read-only. The skill checks things. It doesn’t restart services, scale pods, or modify configuration. That’s intentional. I’ll come back to this.

Incident summarization and post-mortem drafting

After the incident is resolved, nobody wants to spend an hour writing up what happened. An incident summarization skill can compile a timeline from your Slack channel, PagerDuty alerts, deployment logs, and monitoring data.

Feed it the incident Slack channel and it produces:

  • A timeline of events (alert fired at 3:02am, on-call acknowledged at 3:07am, root cause identified at 3:23am, fix deployed at 3:41am, monitoring confirmed recovery at 3:48am)
  • The root cause, pulled from the conversation
  • What was done to resolve it
  • A list of follow-up action items mentioned in the thread

This becomes the first draft of your post-mortem. A human still reviews and edits it. The agent handles the tedious part of collecting timestamps and organizing the narrative. For more on how to write the instructions that make this output useful, see writing skill instructions. If your team also needs to generate release notes at deploy time, the release notes skill handles the full git-to-markdown pipeline, including PR links and contributor lists.

What NOT to automate

Here’s my strong opinion on this: do not give your agent write access to production. Not at first. Maybe not ever, depending on your risk tolerance.

Agents should not:

  • Restart services or pods
  • Roll back deployments
  • Modify database records
  • Change DNS or load balancer configuration
  • Scale infrastructure up or down
  • Modify firewall rules

“But automated rollback would save time!” Sure. It would also roll back perfectly fine deployments when the agent misdiagnoses the problem. It would roll back during a database migration, leaving you with schema mismatches. It would roll back three services in a chain because the agent didn’t understand the dependency order.

The failure modes of automated remediation are worse than the problem it solves. A human can investigate with agent-gathered context in 5 minutes. An agent that rolls back the wrong service can turn a 30-minute incident into a 3-hour one.

The trust ladder

If you still want to move toward more autonomous agents (and I understand the appeal), do it gradually. Think of it as a ladder with four rungs.

Rung 1 is read-only investigation. The agent gathers context, searches logs, checks health endpoints, and reports. This is where every team should start. There’s almost no downside risk. The worst case is the agent gives you a bad summary and you ignore it.

Rung 2 is notifications and routing. The agent creates Jira tickets, updates status pages, posts to Slack, and pages the right person. These are low-risk writes. A bad ticket or a wrong page is annoying but not destructive.

Rung 3 is sandboxed remediation. The agent can take predefined actions in non-production environments. It can restart a staging service or scale a dev cluster. You’re building confidence in the remediation logic without production risk.

Rung 4 is production remediation with approval gates. The agent proposes an action, a human approves it, and the agent executes it. See human-in-the-loop patterns for how to implement approval gates that work under pressure. The key here is that the agent does the typing, but a human makes the decision.

Most teams should stay on rungs 1 and 2 for a long time. Rungs 3 and 4 require serious investment in testing, guardrails, and error handling before they’re safe.

Practical setup tips

A few things I’ve learned from building these kinds of skills:

Keep your skill outputs structured. When a triage skill returns freeform text, it’s hard to pipe into other tools. When it returns JSON with fields for severity, likely_cause, affected_services, and recommended_actions, you can build dashboards and routing logic on top of it.

Version your runbook playbooks the same way you version your application code. When someone updates a health check URL or changes a threshold, that change should go through code review. Your agent’s investigation logic is code. Treat it that way.

Test skills against past incidents. Take your last ten P1s, feed the alert data into your triage skill, and see if the output would have been helpful. This is the fastest way to iterate on prompt quality and skill design.

Set timeouts aggressively. A skill that hangs for five minutes while querying a log aggregator is worse than no skill at all. If the log search takes longer than 30 seconds, return what you have and tell the engineer the full search is still running.

For the kind of maintenance automation that’s safe to run regularly, the dependency updater skill applies the same risk-tiering logic to package updates with test verification before committing. And for documenting your own DevOps surface, the onboarding checklist skill generates a complete developer setup guide by reading your project’s config files.

The on-call engineer’s new workflow

With these skills in place, your 3am incident goes differently. PagerDuty fires. Before you’re fully awake, the triage skill has posted a summary to Slack. You read it on your phone. You already know it’s a database connection issue related to a recent config change. You open your laptop, ask the log search skill for the specific error patterns, confirm the hypothesis in 2 minutes, and deploy the fix. Total time from alert to resolution: 15 minutes instead of 45.

That’s the promise of agents in DevOps. Not replacing the on-call engineer. Making their worst nights less terrible. And if you’re worried about the agent making things worse, keep it read-only. A skill that only gathers information can’t break anything. Start there and see how it goes.

For guidance on deciding when agent automation is the wrong call entirely, see when not to use agents. And for the security implications of giving agents access to production systems, see security considerations.