Developer typescript code-review composition intermediate

Building a Code Review Skill

A multi-step code review skill that diffs changes, analyzes code quality, and generates structured feedback, demonstrating skill composition and error handling.

Most useful agent skills are not single-shot operations. They chain steps together, where one step’s output feeds the next. This example builds a code review skill that retrieves diffs, analyzes them for issues, and produces a structured report. The goal is to show how to compose multi-step workflows while keeping each stage testable on its own.

If you are new to skill design, start with the file search example first. It covers skill definitions, parameter design, and basic error handling. This article builds on those ideas with composition, structured output, and more involved error recovery.

What we’re building

A skill that accepts a set of changed files (or a git ref range), retrieves the diffs, analyzes each one for code quality issues, and returns a structured review with severity levels and line references. An agent could use this to give automated code review feedback during a pull request workflow.

The skill has three stages:

  1. Diff retrieval, where we get the actual changes from git or a provided patch
  2. Analysis, where we scan each diff hunk for issues (complexity, style, bugs, security)
  3. Report generation, where we assemble findings into a structured format the agent can act on

The skill definition

const codeReviewSkill = {
  name: "review_code_changes",
  description: `Review code changes for quality issues, bugs, and style problems.
Use this when you need to analyze a diff or set of changed files and
produce structured feedback. Accepts either a git ref range (e.g.,
"main..feature-branch") or a list of file paths to review against
the working tree.

Returns a structured report with issues categorized by severity
(critical, warning, suggestion) and linked to specific line numbers.

Do NOT use this for formatting or auto-fixing code — use a formatter
or linter tool instead. This skill is read-only and produces feedback
only.`,
  parameters: {
    type: "object",
    properties: {
      refRange: {
        type: "string",
        description:
          'Git ref range to diff (e.g., "main..HEAD", "abc123..def456"). ' +
          "If omitted, reviews unstaged changes in the working tree.",
      },
      files: {
        type: "array",
        items: { type: "string" },
        description:
          "Specific file paths to review. If omitted, reviews all changed files.",
      },
      maxFiles: {
        type: "number",
        description:
          "Maximum number of files to review. Defaults to 20. " +
          "Use this to limit scope on large changesets.",
      },
    },
    required: [],
  },
};

Every parameter is optional, so the most common case (reviewing unstaged changes) works with zero configuration. The maxFiles parameter exists because large changesets produce enormous outputs that can blow out an agent’s context window, a concern covered in more detail in the tool use patterns guide. The description explicitly says what this skill does not do (formatting, auto-fixing) to steer the agent away from misusing it.

Core types

Before the implementation, we define the data structures that flow through the pipeline. Good types pay off immediately: they document the contract, catch bugs at compile time, and make the structured output predictable for the agent.

/** A single issue found during review */
interface ReviewIssue {
  file: string;
  line: number;
  endLine?: number;
  severity: "critical" | "warning" | "suggestion";
  category: "bug" | "security" | "complexity" | "style" | "performance";
  message: string;
  suggestion?: string;
}

/** A parsed diff hunk for a single file */
interface FileDiff {
  path: string;
  hunks: Array<{
    oldStart: number;
    newStart: number;
    lines: string[];
  }>;
  isBinary: boolean;
  isNew: boolean;
  isDeleted: boolean;
}

/** The final structured report */
interface ReviewReport {
  summary: {
    filesReviewed: number;
    totalIssues: number;
    critical: number;
    warnings: number;
    suggestions: number;
  };
  issues: ReviewIssue[];
  skippedFiles: Array<{
    path: string;
    reason: string;
  }>;
  metadata: {
    refRange: string | null;
    reviewedAt: string;
    truncated: boolean;
  };
}

The skippedFiles field matters more than it looks. Binary files, generated files, and huge diffs can’t be meaningfully reviewed, but the agent should still know they existed in the changeset. If you silently drop files, the agent has no way to know the review is incomplete.

Stage 1: diff retrieval

import { execFile } from "child_process";
import { promisify } from "util";
import { stat } from "fs/promises";

const execFileAsync = promisify(execFile);

async function getDiffs(
  refRange?: string,
  files?: string[],
  maxFiles: number = 20,
): Promise<{ diffs: FileDiff[]; truncated: boolean }> {
  // Build the git diff command
  const args = ["diff", "--unified=3", "--no-color"];

  if (refRange) {
    // Validate the ref range to prevent command injection.
    // Git refs can contain alphanumerics, dashes, dots, slashes,
    // tildes, carets, @ and the range operators ".." / "..."
    if (!/^[\w.\/\-~^@]+\.\.\.?[\w.\/\-~^@]+$/.test(refRange)) {
      throw new DiffError(
        `Invalid ref range format: "${refRange}". ` +
          'Expected format like "main..HEAD" or "abc123..def456".',
      );
    }
    args.push(refRange);
  }

  if (files && files.length > 0) {
    args.push("--", ...files);
  }

  let stdout: string;
  try {
    const result = await execFileAsync("git", args, {
      maxBuffer: 10 * 1024 * 1024, // 10MB buffer for large diffs
    });
    stdout = result.stdout;
  } catch (err: unknown) {
    if (err && typeof err === "object" && "code" in err && err.code === 128) {
      throw new DiffError(
        "Not a git repository, or the specified refs do not exist.",
      );
    }
    throw err;
  }

  if (!stdout.trim()) {
    return { diffs: [], truncated: false };
  }

  const allDiffs = parseDiffOutput(stdout);
  const truncated = allDiffs.length > maxFiles;

  return {
    diffs: allDiffs.slice(0, maxFiles),
    truncated,
  };
}

The ref range validation is worth looking at closely. We use execFile rather than exec to avoid shell injection, but we still validate the ref range format as a second line of defense. Agent-provided inputs should always be treated as untrusted. The agent is constructing these strings from user requests, and prompt injection or a simple misunderstanding can produce unexpected values.

The 10MB buffer limit is a practical ceiling. Diffs larger than that are almost certainly too big for useful review feedback, and parsing them would eat a lot of memory.

class DiffError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "DiffError";
  }
}

function parseDiffOutput(raw: string): FileDiff[] {
  const fileSections = raw.split(/^diff --git /m).filter(Boolean);
  return fileSections.map((section) => {
    const lines = section.split("\n");

    // Extract file path from the diff header
    const pathMatch = lines[0].match(/b\/(.+)$/);
    const path = pathMatch ? pathMatch[1] : "unknown";

    // Detect binary files
    if (section.includes("Binary files")) {
      return {
        path,
        hunks: [],
        isBinary: true,
        isNew: false,
        isDeleted: false,
      };
    }

    // Detect new and deleted files
    const isNew = section.includes("new file mode");
    const isDeleted = section.includes("deleted file mode");

    // Parse hunks
    const hunks: FileDiff["hunks"] = [];
    const hunkRegex = /^@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)? @@/;

    let currentHunk: FileDiff["hunks"][0] | null = null;
    for (const line of lines) {
      const hunkMatch = line.match(hunkRegex);
      if (hunkMatch) {
        currentHunk = {
          oldStart: parseInt(hunkMatch[1], 10),
          newStart: parseInt(hunkMatch[2], 10),
          lines: [],
        };
        hunks.push(currentHunk);
      } else if (
        currentHunk &&
        (line.startsWith("+") || line.startsWith("-") || line.startsWith(" "))
      ) {
        currentHunk.lines.push(line);
      }
    }

    return { path, hunks, isBinary: false, isNew, isDeleted };
  });
}

Stage 2: analysis

The analysis stage scans each diff for patterns that indicate problems. In production, you might delegate this to an external linter or even a second LLM call. Here we use a rule-based analyzer to keep the example self-contained.

/** Rule definition: a pattern to match and the issue it produces */
interface AnalysisRule {
  id: string;
  pattern: RegExp;
  severity: ReviewIssue["severity"];
  category: ReviewIssue["category"];
  message: string;
  suggestion?: string;
  /** Only apply to files matching this glob */
  filePattern?: RegExp;
}

const DEFAULT_RULES: AnalysisRule[] = [
  {
    id: "no-console-log",
    pattern: /console\.log\(/,
    severity: "warning",
    category: "style",
    message: "console.log left in code",
    suggestion: "Remove or replace with a proper logger.",
  },
  {
    id: "todo-fixme",
    pattern: /\/\/\s*(TODO|FIXME|HACK|XXX)\b/,
    severity: "suggestion",
    category: "style",
    message: "Unresolved TODO/FIXME comment",
    suggestion: "Address the TODO or create a tracking issue.",
  },
  {
    id: "hardcoded-secret",
    pattern: /(password|secret|api_key|apikey|token)\s*[:=]\s*["'][^"']+["']/i,
    severity: "critical",
    category: "security",
    message: "Possible hardcoded secret or credential",
    suggestion:
      "Move to environment variables or a secrets manager. Never commit credentials.",
  },
  {
    id: "large-function",
    pattern: /^(\+.*\{)\s*$/,
    severity: "suggestion",
    category: "complexity",
    message: "Function may be getting too long",
    suggestion: "Consider extracting into smaller, focused functions.",
    // This rule is simplistic -- a real implementation would count
    // lines between braces. Included to illustrate the pattern.
  },
  {
    id: "any-type",
    pattern: /:\s*any\b/,
    severity: "warning",
    category: "style",
    message: "Usage of 'any' type defeats TypeScript's type safety",
    suggestion: "Use a specific type, 'unknown', or a generic parameter.",
    filePattern: /\.(ts|tsx)$/,
  },
  {
    id: "eval-usage",
    pattern: /\beval\s*\(/,
    severity: "critical",
    category: "security",
    message: "Use of eval() is a security risk",
    suggestion:
      "Replace with a safer alternative like JSON.parse or Function constructor.",
  },
];

function analyzeFileDiff(
  diff: FileDiff,
  rules: AnalysisRule[] = DEFAULT_RULES,
): ReviewIssue[] {
  const issues: ReviewIssue[] = [];

  // Skip binary files entirely
  if (diff.isBinary) return issues;

  // Filter rules to those applicable to this file type
  const applicableRules = rules.filter(
    (rule) => !rule.filePattern || rule.filePattern.test(diff.path),
  );

  for (const hunk of diff.hunks) {
    let currentLine = hunk.newStart;

    for (const line of hunk.lines) {
      // Only analyze added lines (starting with +)
      if (line.startsWith("+")) {
        const content = line.slice(1); // Remove the leading +

        for (const rule of applicableRules) {
          if (rule.pattern.test(content)) {
            issues.push({
              file: diff.path,
              line: currentLine,
              severity: rule.severity,
              category: rule.category,
              message: rule.message,
              suggestion: rule.suggestion,
            });
          }
        }
      }

      // Track the current line number in the new file
      if (!line.startsWith("-")) {
        currentLine++;
      }
    }
  }

  return issues;
}

We only analyze added lines (those starting with + in the diff). Flagging issues in removed code would be confusing since those lines are going away. When building skills that process diffs, always think about which side of the change matters.

The rule set is extensible by design. Each rule is a plain object, so consumers can add custom rules, disable defaults, or load rules from config. This follows the same idea from the tool use patterns guide: make skills configurable without requiring code changes.

Stage 3: report generation

function generateReport(
  diffs: FileDiff[],
  issues: ReviewIssue[],
  truncated: boolean,
  refRange: string | null,
): ReviewReport {
  const skippedFiles: ReviewReport["skippedFiles"] = [];

  // Record skipped binary files
  for (const diff of diffs) {
    if (diff.isBinary) {
      skippedFiles.push({
        path: diff.path,
        reason: "Binary file — cannot perform text-based review",
      });
    }
  }

  // Count severities
  const critical = issues.filter((i) => i.severity === "critical").length;
  const warnings = issues.filter((i) => i.severity === "warning").length;
  const suggestions = issues.filter((i) => i.severity === "suggestion").length;

  // Sort issues: critical first, then warnings, then suggestions.
  // Within each severity, sort by file path then line number.
  const sortedIssues = [...issues].sort((a, b) => {
    const severityOrder = { critical: 0, warning: 1, suggestion: 2 };
    const severityDiff = severityOrder[a.severity] - severityOrder[b.severity];
    if (severityDiff !== 0) return severityDiff;
    const fileDiff = a.file.localeCompare(b.file);
    if (fileDiff !== 0) return fileDiff;
    return a.line - b.line;
  });

  return {
    summary: {
      filesReviewed: diffs.filter((d) => !d.isBinary).length,
      totalIssues: issues.length,
      critical,
      warnings,
      suggestions,
    },
    issues: sortedIssues,
    skippedFiles,
    metadata: {
      refRange,
      reviewedAt: new Date().toISOString(),
      truncated,
    },
  };
}

Sorting issues by severity is not just cosmetic. When the agent reads this report, the most important items show up first. If the agent’s context is limited and it can’t process the whole report, it still sees the critical issues. This applies to any skill that returns variable-length output: put the most important stuff first.

Composing the pipeline

Now we wire the three stages together into a single entry point with error handling at each boundary.

interface ReviewOptions {
  refRange?: string;
  files?: string[];
  maxFiles?: number;
}

type ReviewResult =
  | ReviewReport
  | { success: false; error: string; suggestion: string };

async function reviewCodeChanges(
  options: ReviewOptions = {},
): Promise<ReviewResult> {
  const { refRange, files, maxFiles = 20 } = options;

  try {
    // Stage 1: Retrieve diffs
    const { diffs, truncated } = await getDiffs(refRange, files, maxFiles);

    if (diffs.length === 0) {
      return {
        success: false,
        error: "No changes found to review.",
        suggestion: refRange
          ? `Check that the ref range "${refRange}" contains changes. ` +
            "Try 'git log --oneline' to verify the refs exist."
          : "There are no unstaged changes. Stage some changes or provide a ref range.",
      };
    }

    // Stage 2: Analyze each diff
    const allIssues: ReviewIssue[] = [];
    for (const diff of diffs) {
      const fileIssues = analyzeFileDiff(diff);
      allIssues.push(...fileIssues);
    }

    // Stage 3: Generate the report
    return generateReport(diffs, allIssues, truncated, refRange ?? null);
  } catch (err) {
    if (err instanceof DiffError) {
      return {
        success: false,
        error: err.message,
        suggestion:
          "Verify you are in a git repository and the refs are valid. " +
          "Run 'git status' to check the repository state.",
      };
    }

    return {
      success: false,
      error: `Unexpected error during review: ${
        err instanceof Error ? err.message : String(err)
      }`,
      suggestion:
        "This may be a transient error. Try again, or review the files manually.",
    };
  }
}

Every failure path returns a suggestion field telling the agent what to do next. Without that, the agent is left guessing, and agents are bad at recovering from opaque errors. This pattern comes up again in the what are agent skills guide.

The “no changes found” case is an error, not an empty success. An empty report would be ambiguous: did the skill work and find nothing, or did it fail to locate the changes? An explicit failure with a helpful message is always clearer.

Testing

Tests for a composed skill need to cover each stage independently and the pipeline as a whole. That gives you precise failure localization when something breaks.

import { describe, it, expect, vi, beforeEach } from "vitest";

// --- parseDiffOutput tests ---

describe("parseDiffOutput", () => {
  it("parses a simple single-file diff", () => {
    const raw = `diff --git a/src/index.ts b/src/index.ts
index abc123..def456 100644
--- a/src/index.ts
+++ b/src/index.ts
@@ -10,3 +10,4 @@ function main() {
   const x = 1;
   const y = 2;
+  console.log(x + y);
   return x + y;
`;
    const diffs = parseDiffOutput(raw);
    expect(diffs).toHaveLength(1);
    expect(diffs[0].path).toBe("src/index.ts");
    expect(diffs[0].isBinary).toBe(false);
    expect(diffs[0].hunks).toHaveLength(1);
    expect(diffs[0].hunks[0].newStart).toBe(10);
  });

  it("detects binary files", () => {
    const raw = `diff --git a/logo.png b/logo.png
Binary files /dev/null and b/logo.png differ
`;
    const diffs = parseDiffOutput(raw);
    expect(diffs).toHaveLength(1);
    expect(diffs[0].isBinary).toBe(true);
    expect(diffs[0].hunks).toHaveLength(0);
  });

  it("detects new files", () => {
    const raw = `diff --git a/new-file.ts b/new-file.ts
new file mode 100644
index 0000000..abc1234
--- /dev/null
+++ b/new-file.ts
@@ -0,0 +1,3 @@
+export function hello() {
+  return "world";
+}
`;
    const diffs = parseDiffOutput(raw);
    expect(diffs[0].isNew).toBe(true);
    expect(diffs[0].isDeleted).toBe(false);
  });

  it("handles multiple files in one diff", () => {
    const raw = `diff --git a/a.ts b/a.ts
index 111..222 100644
--- a/a.ts
+++ b/a.ts
@@ -1,1 +1,2 @@
 line1
+line2
diff --git a/b.ts b/b.ts
index 333..444 100644
--- a/b.ts
+++ b/b.ts
@@ -1,1 +1,2 @@
 lineA
+lineB
`;
    const diffs = parseDiffOutput(raw);
    expect(diffs).toHaveLength(2);
    expect(diffs[0].path).toBe("a.ts");
    expect(diffs[1].path).toBe("b.ts");
  });
});

// --- analyzeFileDiff tests ---

describe("analyzeFileDiff", () => {
  function makeDiff(path: string, addedLines: string[]): FileDiff {
    return {
      path,
      isBinary: false,
      isNew: false,
      isDeleted: false,
      hunks: [
        {
          oldStart: 1,
          newStart: 1,
          lines: addedLines.map((l) => `+${l}`),
        },
      ],
    };
  }

  it("flags console.log statements", () => {
    const diff = makeDiff("src/app.ts", ['console.log("debug");']);
    const issues = analyzeFileDiff(diff);
    expect(issues).toHaveLength(1);
    expect(issues[0].severity).toBe("warning");
    expect(issues[0].category).toBe("style");
    expect(issues[0].line).toBe(1);
  });

  it("flags hardcoded secrets as critical", () => {
    const diff = makeDiff("config.ts", ['const apiKey = "sk-12345abc";']);
    const issues = analyzeFileDiff(diff);
    const secret = issues.find((i) => i.category === "security");
    expect(secret).toBeDefined();
    expect(secret!.severity).toBe("critical");
  });

  it("flags eval usage as critical", () => {
    const diff = makeDiff("utils.ts", ["const result = eval(userInput);"]);
    const issues = analyzeFileDiff(diff);
    const evalIssue = issues.find((i) => i.message.includes("eval"));
    expect(evalIssue).toBeDefined();
    expect(evalIssue!.severity).toBe("critical");
  });

  it("only applies file-specific rules to matching files", () => {
    const tsDiff = makeDiff("src/app.ts", ["const x: any = 5;"]);
    const jsDiff = makeDiff("src/app.js", ["const x: any = 5;"]);
    expect(analyzeFileDiff(tsDiff).some((i) => i.message.includes("any"))).toBe(
      true,
    );
    expect(analyzeFileDiff(jsDiff).some((i) => i.message.includes("any"))).toBe(
      false,
    );
  });

  it("skips binary files without errors", () => {
    const binary: FileDiff = {
      path: "image.png",
      isBinary: true,
      isNew: false,
      isDeleted: false,
      hunks: [],
    };
    expect(analyzeFileDiff(binary)).toEqual([]);
  });

  it("only analyzes added lines, not removed lines", () => {
    const diff: FileDiff = {
      path: "src/app.ts",
      isBinary: false,
      isNew: false,
      isDeleted: false,
      hunks: [
        {
          oldStart: 1,
          newStart: 1,
          lines: ['-console.log("old debug");', "+// cleaned up"],
        },
      ],
    };
    const issues = analyzeFileDiff(diff);
    expect(issues.some((i) => i.message.includes("console.log"))).toBe(false);
  });

  it("reports correct line numbers across hunks", () => {
    const diff: FileDiff = {
      path: "src/app.ts",
      isBinary: false,
      isNew: false,
      isDeleted: false,
      hunks: [
        {
          oldStart: 1,
          newStart: 1,
          lines: [" clean line", " another clean line"],
        },
        {
          oldStart: 50,
          newStart: 50,
          lines: [" ok", '+console.log("here");'],
        },
      ],
    };
    const issues = analyzeFileDiff(diff);
    expect(issues).toHaveLength(1);
    expect(issues[0].line).toBe(51);
  });
});

// --- generateReport tests ---

describe("generateReport", () => {
  it("counts severities correctly", () => {
    const issues: ReviewIssue[] = [
      {
        file: "a.ts",
        line: 1,
        severity: "critical",
        category: "security",
        message: "bad",
      },
      {
        file: "a.ts",
        line: 2,
        severity: "warning",
        category: "style",
        message: "meh",
      },
      {
        file: "b.ts",
        line: 1,
        severity: "suggestion",
        category: "style",
        message: "idea",
      },
      {
        file: "b.ts",
        line: 5,
        severity: "critical",
        category: "bug",
        message: "oops",
      },
    ];
    const report = generateReport([], issues, false, "main..HEAD");
    expect(report.summary.critical).toBe(2);
    expect(report.summary.warnings).toBe(1);
    expect(report.summary.suggestions).toBe(1);
    expect(report.summary.totalIssues).toBe(4);
  });

  it("sorts issues by severity then file then line", () => {
    const issues: ReviewIssue[] = [
      {
        file: "b.ts",
        line: 10,
        severity: "suggestion",
        category: "style",
        message: "s1",
      },
      {
        file: "a.ts",
        line: 5,
        severity: "critical",
        category: "bug",
        message: "c1",
      },
      {
        file: "a.ts",
        line: 1,
        severity: "critical",
        category: "security",
        message: "c2",
      },
    ];
    const report = generateReport([], issues, false, null);
    expect(report.issues[0].message).toBe("c2"); // critical, a.ts, line 1
    expect(report.issues[1].message).toBe("c1"); // critical, a.ts, line 5
    expect(report.issues[2].message).toBe("s1"); // suggestion, b.ts, line 10
  });

  it("records skipped binary files", () => {
    const diffs: FileDiff[] = [
      {
        path: "logo.png",
        isBinary: true,
        isNew: false,
        isDeleted: false,
        hunks: [],
      },
      {
        path: "src/app.ts",
        isBinary: false,
        isNew: false,
        isDeleted: false,
        hunks: [],
      },
    ];
    const report = generateReport(diffs, [], false, null);
    expect(report.skippedFiles).toHaveLength(1);
    expect(report.skippedFiles[0].path).toBe("logo.png");
    expect(report.summary.filesReviewed).toBe(1);
  });
});

// --- Integration test for the full pipeline ---

// Mock the module that exports getDiffs. vi.mock hoists to the top
// of the file, so the mock is in place before any imports run.
vi.mock("./diff-utils", () => ({
  getDiffs: vi.fn(),
}));

import { getDiffs } from "./diff-utils";
const mockGetDiffs = vi.mocked(getDiffs);

describe("reviewCodeChanges (integration)", () => {
  // These tests mock getDiffs to avoid requiring a real git repo

  beforeEach(() => {
    mockGetDiffs.mockReset();
  });

  it("returns a complete report for valid changes", async () => {
    mockGetDiffs.mockResolvedValue({
      diffs: [
        {
          path: "src/app.ts",
          isBinary: false,
          isNew: false,
          isDeleted: false,
          hunks: [
            {
              oldStart: 1,
              newStart: 1,
              lines: ['+console.log("debug");', "+const x: any = 5;"],
            },
          ],
        },
      ],
      truncated: false,
    });

    const result = await reviewCodeChanges({ refRange: "main..HEAD" });
    // Type guard to verify we got a report, not an error
    expect("summary" in result).toBe(true);
    if ("summary" in result) {
      expect(result.summary.filesReviewed).toBe(1);
      expect(result.summary.totalIssues).toBeGreaterThan(0);
    }
  });

  it("returns an error when no changes are found", async () => {
    mockGetDiffs.mockResolvedValue({
      diffs: [],
      truncated: false,
    });

    const result = await reviewCodeChanges({ refRange: "main..main" });
    expect("success" in result && result.success === false).toBe(true);
  });
});

Testing strategy

The tests are layered to match the pipeline. Unit tests for parseDiffOutput cover real-world diff formats: single files, binary files, new files, multi-file diffs. Unit tests for analyzeFileDiff check that each rule fires correctly, line numbers are tracked, and edge cases (binary files, removed lines) work. Unit tests for generateReport verify aggregation, sorting, and skipped file tracking. Integration tests mock the git layer and check end-to-end behavior.

This layering means that when a test fails, you know which stage broke. If only the parser tests fail, you know the analysis and reporting logic are fine.

Design tradeoffs

Rule-based vs. LLM-based analysis. We used pattern-matching rules here. You could also send each diff to an LLM for review. Rules are fast, deterministic, and free. LLM analysis is slower, non-deterministic, and costs tokens, but it catches semantic issues that no regex can find (logic errors, missing error handling). A production skill might do both: run rules first for fast feedback, then optionally call an LLM for deeper analysis on the most complex changes.

Structured output vs. prose. We return a typed ReviewReport object rather than a markdown string. This is almost always the right call for skill output. The agent can parse structured data reliably, filter issues by severity, and decide how to present results to the user. Returning prose would force the agent to parse natural language to extract actionable items, which is fragile.

Fail loudly on ambiguous cases. When there are no changes to review, we return an explicit error rather than an empty report. The agent should never have to guess whether a result means “correct but empty” or “something went wrong.”

Line number tracking. Keeping accurate line numbers through diff hunks takes careful accounting. We track currentLine based on the hunk’s newStart and only increment for lines that appear in the new file (added lines and context lines, not removed lines). If this is wrong, the agent’s feedback points to the wrong lines, which is useless or actively misleading.

Extending this skill

Some natural extensions that follow the same composition pattern:

  • Accept rules as a parameter so different projects can enforce different standards
  • Return code patches alongside issues, which another skill could apply
  • Store past reviews and flag recurring issues
  • Fail a CI check if critical issues exceed a count, integrating with the pipeline patterns in the data pipeline example

Building a multi-stage skill like this teaches you something that single-shot skills don’t: the boundaries between stages matter more than the logic inside them. When each stage has a clean contract and its own tests, you can swap out the rule-based analyzer for an LLM-based one tomorrow without touching the diff parser or the report generator. That modularity is what makes a composed skill worth the extra structure compared to one big function that does everything.