Development
How developers work with AI day-to-day. From sidebar chat to fleet agents.
Maturity →
Coding Agent Usage
How your team uses AI coding assistants - from autocomplete to autonomous agent fleets.
- Copilot autocompleteHow to use IDE autocomplete as your first step into AI-assisted development.guide→
- Chat in sidebar, ad-hoc questionsHow to use the AI chat panel for one-off code questions and explanations before any systematic workflow exists.guide→
- Agent runs without codebase contextUnderstanding why AI tools at L1 see only what you show them - and why that's the core limitation the entire maturity journey addresses.guide→
- 01At least one AI coding assistant (Copilot, Cursor, Claude Code) is installed and active for at least one developer
- 02AI autocomplete or chat is used at least once per week by the team
- 03Developers have access to AI chat in their IDE sidebar
- 04Team has experimented with AI-assisted code generation on non-critical tasks
- Agent in IDE with YOLO mode (Cursor 3.2, Windsurf, Claude Code with Opus 4.7)How to use autonomous AI agents with auto-approval enabled to execute multi-file changes without constant confirmation interrupts.guide→
- CLAUDE.md / .cursorrules / AGENTS.md in repoHow to write an agent instruction file that teaches AI tools about your project's conventions, patterns, and constraints - the single highest-leverage action at L2.guide→
- Copilot + Claude Code in parallelHow to use inline autocomplete and an agentic CLI tool simultaneously, each at the granularity it handles best.guide→
- 01At least one agentic IDE (Cursor, Windsurf, or Claude Code) is used by 50%+ of the team
- 02CLAUDE.md, .cursorrules, or equivalent agent instruction file exists in 100% of active repositories
- 03Agents operate in agentic/YOLO mode (multi-step edits without per-step approval)
- 04Developers use two or more AI tools in parallel (e.g., Copilot + Claude Code)
- 05Agent instruction files are reviewed and updated at least quarterly
- CLI agents (Claude Code with Opus 4.7 + xhigh effort) as primaryHow shifting from IDE plugins to CLI-based agents makes AI a programmable, scriptable part of your development workflow rather than a typing assistant.guide→
- Rules files per-team/per-repoHow to evolve from a single project-level CLAUDE.md to a layered system of context files tailored to each team's tech stack and conventions.guide→
- Agent-aware coding conventions (explicit > implicit)How to rewrite your team's coding conventions to be machine-readable - precise, example-driven, and structured for AI consumption rather than human reading.guide→
- 01CLI agents (Claude Code, Codex) are the primary coding interface for 50%+ of feature work
- 02Per-team or per-repo rules files exist and are maintained with code review
- 03Coding conventions are written as explicit, agent-parseable rules (not implicit tribal knowledge)
- 04Agent usage is tracked per developer and per repository
- 05Agent instruction files follow a standardized template across the organization
- One-shot unattended agents (Stripe Minions, Cursor Automations)How to launch AI agents that run to completion autonomously - writing code, running tests, fixing errors, and producing a PR without human supervision.guide→
- Slack/CLI/Web/PagerDuty invocation → PRHow to trigger AI agent tasks from natural language interfaces - a Slack message, a CLI command, a web form - and have the agent autonomously produce a pull request.guide→
- 3-5 parallel agents per developer (Cursor 3.2 /multitask async subagents, Windsurf)How to shift from sequential AI assistance to managing multiple concurrent agent instances - transforming the developer's role from implementer to orchestrator.guide→
- 01Unattended agents (Stripe Minions model, Cursor Automations) execute tasks without developer presence
- 02Agents are invocable from at least two channels (Slack, CLI, Web, PagerDuty)
- 03Each developer runs 3-5 parallel agent sessions concurrently
- 04Agent task completion rate without human intervention exceeds 60%
- 05Agent invocation produces a PR within a defined SLA (e.g., under 30 minutes for standard tasks)
- Multi-agent orchestration (Gas Town / custom)How to build systems where specialized agents collaborate - a planner decomposes tasks, workers execute them, and reviewers validate results - to handle complex engineering tasks end-to-end.guide→
- Planner → Worker hierarchyHow to structure a two-tier agent architecture where a planner decomposes engineering tasks and workers execute them in parallel - the canonical L5 pattern for complex autonomous development.guide→
- Hundreds of agents on codebase, 1000+ commits/hThe frontier of AI-assisted development: massive agent parallelization where hundreds of concurrent agents produce thousands of commits per hour on a single codebase.guide→
- 01Multi-agent orchestration system (planner-worker hierarchy) is in production
- 02Agent fleet sustains 100+ concurrent agents on the codebase
- 03Agent fleet produces 1,000+ commits per week without manual dispatch
- 04Planner agents decompose epics into tasks and assign to worker agents autonomously
- 05Agent fleet self-recovers from failures without human escalation for 90%+ of error cases
Context Engineering
What information agents receive about your codebase, architecture, and conventions.
- Agent works from the currently open fileWhy AI agents at L1 are flying blind - they see only the open file, with no awareness of your project's architecture, conventions, or dependencies.guide→
- Project knowledge lives with people, not docsAt L1, the rules governing your codebase live in developers' memories - invisible to AI agents, fragile under turnover, and impossible to scale.guide→
- Onboarding leans on the existing README and peopleStale documentation is more than a developer experience problem - it actively prevents AI agents from using docs as context and signals that the team hasn't invested in machine-readable knowledge.guide→
- 01The agent can read the file(s) the developer is working on
- 02Developers can supply the agent with project context when needed
- 03README.md exists (may be incomplete)
- 04Developers manually paste context into AI chat when needed
- CLAUDE.md with basic project infoAdding a CLAUDE.md to your repository root is the single highest-leverage context engineering investment - it gives every AI agent the minimum viable information about what your project is and how it works.guide→
- Written coding conventionsDocumenting your team's semantic coding decisions - not just style rules - gives AI agents the judgment framework they need to suggest code that fits your architecture, not just code that compiles.guide→
- Agent instruction files in repo (CLAUDE.md, AGENTS.md, .agent.md, DESIGN.md replacing Figma exports)Agent instruction files - CLAUDE.md, .cursorrules, copilot-instructions.md - have become a standard software project artifact, with 60,000+ repositories on GitHub already containing them.guide→
- 01CLAUDE.md or equivalent exists with project description, tech stack, and top conventions
- 02Written coding conventions document exists and is referenced from agent instruction files
- 03Agent instruction files are committed to the repository (not local-only)
- 04CLAUDE.md includes explicit prohibitions (banned libraries, anti-patterns)
- 05Agent instruction files are reviewed as part of the standard PR process
- MCP servers: architecture, ownership, SLA (universal standard)Model Context Protocol servers are the infrastructure layer of context engineering at L3 - dedicated services that give AI agents structured, real-time access to your organization's knowledge.guide→
- 5-level context: System, Code, Org, Historical, OperationalA structured framework for the five types of context an AI agent needs to make good decisions - and a diagnostic tool for understanding why your agents keep getting things wrong.guide→
- Context budgeting (handoff.md, .claudeignore, KV cache TTL aware)AI models have finite context windows - context budgeting is the practice of deliberately allocating that budget across context types to maximize agent effectiveness per token spent.guide→
- 01MCP servers provide structured context (architecture, ownership, SLAs) to agents
- 02Context is organized across at least 3 of the 5 levels: System, Code, Org, Historical, Operational
- 03Token budget management is implemented (agents receive context within defined token limits)
- 04Context sources are versioned and tested for correctness
- 05Context budgeting policy defines priority order when token limits are reached
- BYOC: org PUSHES context to the agentAt L4, organizations flip the context model - instead of agents pulling context on demand, the org pre-assembles and pushes a rich context package to agents at task start, eliminating discovery latency and ensuring consistency.guide→
- Knowledge graph (Graph Buddy / CodeTale / pitlane-mcp)Semantic knowledge graphs of codebases - built by tools like Graph Buddy and CodeTale - give AI agents structural understanding of your codebase without requiring them to read every file.guide→
- Spec-Driven Development: AGENTS.md as shared spec interface (Andrew Ng + JetBrains, Thoughtworks)At L4, AI agents transform Jira or Linear tickets into structured specifications and failing acceptance tests before a developer touches the implementation - creating a context artifact that guides the entire downstream workflow.guide→
- 01Organization pushes context to agents automatically (BYOC - Bring Your Own Context)
- 02Knowledge graph (Graph Buddy, CodeTale, or equivalent) is integrated with agent context pipeline
- 03Ticket-to-spec automation generates acceptance tests from requirements without manual writing
- 04Context push triggers on repository events (commit, PR, deploy) without manual refresh
- 05Knowledge graph covers 80%+ of active repositories
- Persistent agent identity + memory (Beads/Git, Kairos/Dream Mode 4-stage consolidation)At L5, agents maintain persistent identity and memory across sessions using structured memory files committed to git - accumulating codebase-specific institutional knowledge that makes them progressively more effective over time.guide→
- Production telemetry → context auto-updateAt L5, agent context updates automatically based on production signals - when a service degrades, agents working on related code receive updated operational context without manual intervention.guide→
- Self-healing context: agent detects stale docs, updatesAt L5, agents monitor context quality continuously - detecting when CLAUDE.md entries, README sections, or architecture diagrams no longer match the codebase, and automatically proposing or applying updates.guide→
- 01Agents maintain persistent identity and memory across sessions (Beads/Git-backed)
- 02Production telemetry feeds back into agent context automatically (deploy, error, performance data)
- 03Agents detect stale documentation and update it without human initiation
- 04Agent memory persists architectural decisions and their rationale across sessions
- 05Self-healing context updates are validated by automated tests before commit
Code Review & Quality
How AI-generated code is reviewed, validated, and approved before merging. AI code has 1.7x more issues and 2.74x more security vulnerabilities - review is now critical infrastructure.
- Every PR gets human reviewWhy relying entirely on human code review creates a quality bottleneck that scales poorly and establishes the baseline every higher maturity level is designed to escape.guide→
- Review turnaround measured in hoursThe 2-hour average wait for code review feedback is a measurable symptom of L1's structural review problem - and a concrete baseline to improve against.guide→
- AI and human code share one review pathAt L1, teams can't see how much of their codebase is AI-generated - making it impossible to measure adoption, calibrate review depth, or understand quality patterns.guide→
- 01All code is reviewed by a human before merge
- 02Basic CI checks run on changes
- 03Code review turnaround is tracked (even if slow)
- 04Team is aware that AI-generated code has higher defect rates (1.7x issues, 2.74x security vulnerabilities)
- AI-assisted review suggestions (CodeRabbit, Qodo 2.0)Using AI tools to generate a first-pass review frees human reviewers from routine checks and focuses their attention on architecture and business logic.guide→
- Basic linter rulesA shared linter configuration enforced in CI is the fastest, cheapest quality investment a team can make - and the essential foundation for every higher-level quality automation.guide→
- Diff awareness - reviewer knows it's AI code; track thinking-length and files-read-before-edit per sessionWhen reviewers know which parts of a PR are AI-generated, they can calibrate review depth to match the actual risk - spending more time on business logic and less on syntax.guide→
- 01AI-assisted review tool (CodeRabbit, Qodo, or equivalent) is active on all repositories
- 02Linter rules are configured and run in CI on every PR
- 03PRs clearly indicate whether code is AI-generated or AI-assisted (labels, tags, or commit metadata)
- 04AI review suggestions are triaged (accepted/rejected) rather than ignored
- 05Linter configuration is committed to the repository and versioned
- Lint-as-architecture (standards = enforced rules)Custom lint rules that enforce architectural decisions turn code review comments into machine-checked constraints - so architectural violations are caught at CI time, not weeks later in production.guide→
- AI review agent as first pass (with self-verification on Opus 4.7)A dedicated AI review agent that automatically reviews every PR before human reviewers are notified transforms review from a bottleneck into a parallel, always-available quality gate.guide→
- Architecture guardrails: Bug → Codify → Lint RuleA systematic process for converting recurring bugs into permanent lint rules turns each incident into an improvement to the quality gate, progressively hardening the codebase against known failure modes.guide→
- 01AI review agent runs as a first-pass reviewer on every PR before human review
- 02Lint rules enforce architectural standards (not just style) - the "Bug to Codify to Lint Rule" pipeline is active
- 03At least 3 architectural guardrail rules have been created from past bugs or incidents
- 04AI review agent findings are categorized by severity (info, warning, blocking)
- 05New lint rules are proposed automatically when recurring review comments are detected
- Green/Yellow/Red auto-evaluationA traffic-light quality evaluation system that replaces binary pass/fail CI with a nuanced, policy-driven assessment - enabling selective automation and focusing human review where it genuinely adds value.guide→
- Green = auto-merge (fully algorithmic)Automatically merging PRs that meet all quality criteria removes human review from the critical path for routine changes - and is the step that most teams find psychologically difficult but transformatively efficient.guide→
- Policy-based auto-approval: 60%+ Green target (do NOT use raw benchmark scores - Berkeley Apr 12 showed 8/8 reward-hackable)Setting a 60%+ Green rate as an organizational policy turns code quality into a measurable team KPI - and makes the auto-merge system self-reinforcing as teams work to qualify more PRs.guide→
- 01Automated Green/Yellow/Red classification runs on every PR
- 02Green-classified PRs auto-merge without human review
- 03Auto-approve rate target of 60%+ Green PRs is tracked and reported
- 04Yellow PRs receive expedited human review (within 1 hour)
- 05Classification model accuracy is validated monthly against human review outcomes
- Agent fleet self-reviews (Cursor model: error → fix → converge)At fleet scale, agents review and correct their own work in tight feedback loops - running tests, observing failures, fixing root causes, and iterating until the code converges to a passing state without human intervention.guide→
- Human review only for Red (architectural)At L5, human engineering attention is reserved exclusively for Red PRs - architectural changes, security-sensitive modifications, and business logic decisions that automated systems can't confidently evaluate.guide→
- Continuous auto-refactoring in backgroundBackground agents that continuously identify and execute code quality improvements - extracting duplication, simplifying complexity, updating deprecated APIs - eliminate technical debt accumulation without dedicated refactoring sprints.guide→
- 01Agent fleet self-reviews code (error-fix-converge loop) before submitting for merge
- 02Human review is limited to Red-classified PRs (architectural decisions only)
- 03Continuous auto-refactoring runs in background without human initiation
- 04Agent self-review catches 90%+ of issues that would be found by human review
- 05Auto-refactoring PRs are tracked separately and have their own quality metrics
Testing Strategy
How tests are written, maintained, and validated in an AI-assisted workflow. A test "oracle" is the source of truth for what a test's correct result should be.
- Tests written by handThe baseline testing state at L1 - manual test writing, chronic under-coverage, and the compounding debt that makes AI-generated code increasingly risky to ship.guide→
- Flaky tests are a recurring costFlaky tests silently consume a sixth of your engineering capacity - Google's internal research quantified the cost, and at L1 they're accepted as normal instead of eliminated.guide→
- AI tests treat current output as "correct" - no independent oracle for the intended resultWhen AI generates tests by reading your implementation, it encodes existing behavior as correct - including bugs - giving you coverage numbers that feel like safety but provide none.guide→
- 01An automated test suite exists and runs
- 02The team writes and maintains its own tests
- 03Team is aware of flaky test impact (16% of dev time per Google data)
- 04AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)
- Agent-generated unit tests + human acceptance testsA hybrid testing strategy at L2 that uses AI to generate unit test scaffolding at scale while keeping business-behavior verification in human hands.guide→
- Flaky test quarantineA systematic L2 process for removing flaky tests from the main CI signal without deleting them - creating accountability to fix them while keeping builds reliable.guide→
- Humans define expected results for key paths (acceptance tests are the oracle)Fixing flaky tests at the root by replacing fragile implementation-coupled assertions with stable, behavior-level oracles that reliably distinguish real failures from noise.guide→
- 01Agents generate unit tests; humans write acceptance tests
- 02Flaky test quarantine process is active (flaky tests are isolated, not deleted)
- 03Humans define the expected results for important paths (not just snapshotting current output)
- 04Flaky test count is tracked and reported weekly
- 05Quarantined tests have a resolution SLA (e.g., fix or delete within 30 days)
- Expected results come from requirements (tickets/specs are the oracle, not the code)TORS quantifies what percentage of test failures are real bugs - at L3, achieving 90%+ is the prerequisite for trusting automated quality gates and enabling AI-driven testing workflows.guide→
- Acceptance tests from tickets (Autonomous Requirements)At L3, AI agents read requirements tickets and generate failing acceptance tests before implementation begins - making requirements machine-executable and eliminating circular testing at scale.guide→
- Incremental test selection (only changed paths)Running only the tests affected by a given code change - using dependency graph analysis - so CI feedback stays fast as the codebase grows to millions of lines.guide→
- 01Expected results are derived from requirements/specs (the requirement is the oracle, not the code)
- 02Acceptance tests are auto-generated from ticket requirements (Autonomous Requirements pipeline)
- 03Incremental test selection runs only tests affected by changed code paths
- 04Oracle reliability is reviewed per service, not just overall
- 05Test generation from tickets includes edge cases, not just happy paths
- A failing test reliably means a real defect (oracles trusted to gate releases)At L4, raising the Test Oracle Reliability Score to 95%+ is the prerequisite for trusting automated merge decisions - where 1-in-20 false positives is the maximum the system can tolerate.guide→
- Agent iterates tests to green in sandbox (doesn't block team CI)AI agents fix failing tests in isolated sandboxes - running their own private CI loop - so agent work-in-progress never pollutes the shared pipeline or slows the team.guide→
- Mutation testing agent validationAt L4, AI agents use mutation testing to verify that agent-generated test suites actually catch bugs - not just execute code - before those tests enter the shared codebase.guide→
- 01A failing test reliably indicates a real defect (oracle false-positives are rare)
- 02Agents iterate tests to green in isolated sandbox CI without blocking team CI queue
- 03Mutation testing validates that tests catch real defects (not just achieve coverage)
- 04Sandbox CI iteration count per PR is tracked (ITS target: 1-3)
- 05Mutation testing kill rate exceeds 80%
- Self-healing test suiteAt L5, AI agents continuously maintain the test suite - detecting and fixing flaky tests, updating tests broken by intentional refactors, and generating tests for uncovered paths - so humans set quality policy rather than doing quality work.guide→
- Production logs → auto-generated regression testsAt L5, agents mine production errors to automatically generate regression tests - capturing the exact inputs that caused real failures so they can never reach production again.guide→
- Agent detects edge case → writes test → fixes bug → shipsThe fully autonomous quality loop at L5: an agent finds an edge case, writes a failing test, fixes the bug, verifies all tests pass, and submits the PR without any human involvement in the cycle.guide→
- 01Test suite is self-healing (agent detects broken tests, diagnoses root cause, fixes without human input)
- 02Production logs automatically generate regression tests for observed failures
- 03Agents detect edge cases, write tests, fix bugs, and ship - full autonomous loop
- 04Self-healing test updates are validated by mutation testing before merge
- 05Production-to-test pipeline latency is under 1 hour (failure observed to regression test committed)
You don't have to figure this out alone.
Every level in this matrix has a path. Read the playbooks the teams that have climbed it wrote. Run the assessment with our consultants. Start where you are.
Book an AI Maturity Assessment session with your team.
We walk you through all four perspectives, score where you actually are, and leave you with a 90-day plan to climb in the dimensions that matter most.
The May 2026 zeitgeist is trust, but audit.
April delivered three uncomfortable proofs: Stella Laurenzo's audit of 6,852 Claude Code sessions measured a 73% collapse in median thinking length (2,200 -> 600 chars) and a drop from 6.6 to 2.0 files read before edit; Anthropic's April 23 postmortem confirmed that harness and system-prompt changes - not the model - caused weeks of regression; UC Berkeley showed all eight major agent benchmarks are reward-hackable to ~100%. The takeaway is not "AI got worse." The takeaway is that capability lives in the harness, the prompt and the permission layer - and you need to measure it like any other production system.
This shifts the bar at every level. L2 review now means tracking thinking-length and files-read-before-edit per session, not just diff awareness. L3 AI-review agents must self-verify their outputs (Opus 4.7 makes this mainstream). L4 auto-approval policies must NOT be derived from benchmark scores - they need behavioural signals. The model is better than ever (Opus 4.7: 87.6% SWE-bench Verified, Mythos Preview: 93.9%). The tooling is better than ever (Cursor 3.2 /multitask, Kairos/Dream Mode in Claude Code). The new gap is observability. Start there.