Maturity Matrix
Matrix/Development

Development

How developers work with AI day-to-day. From sidebar chat to fleet agents.

4capabilities20levels60practices60guides
The matrix · at a glance
Capability ↓
Maturity →
L1 · Stage 01
Ad-hoc
L2 · Stage 02
Guided
L3 · Stage 03
Systematic
L4 · Stage 04
Optimized
Sweet spot
L5 · Stage 05
Autonomous
01
Coding Agent Usage
02
Context Engineering
03
Code Review & Quality
04
Testing Strategy
Capability 01 · Development

Coding Agent Usage

How your team uses AI coding assistants - from autocomplete to autonomous agent fleets.

L1 · Stage 01Ad-hoc
Criteria - what to measure
  1. 01At least one AI coding assistant (Copilot, Cursor, Claude Code) is installed and active for at least one developer
  2. 02AI autocomplete or chat is used at least once per week by the team
  3. 03Developers have access to AI chat in their IDE sidebar
  4. 04Team has experimented with AI-assisted code generation on non-critical tasks
L2 · Stage 02Guided
Criteria - what to measure
  1. 01At least one agentic IDE (Cursor, Windsurf, or Claude Code) is used by 50%+ of the team
  2. 02CLAUDE.md, .cursorrules, or equivalent agent instruction file exists in 100% of active repositories
  3. 03Agents operate in agentic/YOLO mode (multi-step edits without per-step approval)
  4. 04Developers use two or more AI tools in parallel (e.g., Copilot + Claude Code)
  5. 05Agent instruction files are reviewed and updated at least quarterly
L3 · Stage 03Systematic
Criteria - what to measure
  1. 01CLI agents (Claude Code, Codex) are the primary coding interface for 50%+ of feature work
  2. 02Per-team or per-repo rules files exist and are maintained with code review
  3. 03Coding conventions are written as explicit, agent-parseable rules (not implicit tribal knowledge)
  4. 04Agent usage is tracked per developer and per repository
  5. 05Agent instruction files follow a standardized template across the organization
L4 · Stage 04OptimizedMost teams aim here
Criteria - what to measure
  1. 01Unattended agents (Stripe Minions model, Cursor Automations) execute tasks without developer presence
  2. 02Agents are invocable from at least two channels (Slack, CLI, Web, PagerDuty)
  3. 03Each developer runs 3-5 parallel agent sessions concurrently
  4. 04Agent task completion rate without human intervention exceeds 60%
  5. 05Agent invocation produces a PR within a defined SLA (e.g., under 30 minutes for standard tasks)
L5 · Stage 05Autonomous
Criteria - what to measure
  1. 01Multi-agent orchestration system (planner-worker hierarchy) is in production
  2. 02Agent fleet sustains 100+ concurrent agents on the codebase
  3. 03Agent fleet produces 1,000+ commits per week without manual dispatch
  4. 04Planner agents decompose epics into tasks and assign to worker agents autonomously
  5. 05Agent fleet self-recovers from failures without human escalation for 90%+ of error cases
Capability 02 · Development

Context Engineering

What information agents receive about your codebase, architecture, and conventions.

L2 · Stage 02Guided
L3 · Stage 03Systematic
Capability 03 · Development

Code Review & Quality

How AI-generated code is reviewed, validated, and approved before merging. AI code has 1.7x more issues and 2.74x more security vulnerabilities - review is now critical infrastructure.

L2 · Stage 02Guided
Criteria - what to measure
  1. 01AI-assisted review tool (CodeRabbit, Qodo, or equivalent) is active on all repositories
  2. 02Linter rules are configured and run in CI on every PR
  3. 03PRs clearly indicate whether code is AI-generated or AI-assisted (labels, tags, or commit metadata)
  4. 04AI review suggestions are triaged (accepted/rejected) rather than ignored
  5. 05Linter configuration is committed to the repository and versioned
L3 · Stage 03Systematic
Criteria - what to measure
  1. 01AI review agent runs as a first-pass reviewer on every PR before human review
  2. 02Lint rules enforce architectural standards (not just style) - the "Bug to Codify to Lint Rule" pipeline is active
  3. 03At least 3 architectural guardrail rules have been created from past bugs or incidents
  4. 04AI review agent findings are categorized by severity (info, warning, blocking)
  5. 05New lint rules are proposed automatically when recurring review comments are detected
Capability 04 · Development

Testing Strategy

How tests are written, maintained, and validated in an AI-assisted workflow. A test "oracle" is the source of truth for what a test's correct result should be.

L2 · Stage 02Guided
Criteria - what to measure
  1. 01Agents generate unit tests; humans write acceptance tests
  2. 02Flaky test quarantine process is active (flaky tests are isolated, not deleted)
  3. 03Humans define the expected results for important paths (not just snapshotting current output)
  4. 04Flaky test count is tracked and reported weekly
  5. 05Quarantined tests have a resolution SLA (e.g., fix or delete within 30 days)
L3 · Stage 03Systematic
L4 · Stage 04OptimizedMost teams aim here
L5 · Stage 05Autonomous
Criteria - what to measure
  1. 01Test suite is self-healing (agent detects broken tests, diagnoses root cause, fixes without human input)
  2. 02Production logs automatically generate regression tests for observed failures
  3. 03Agents detect edge cases, write tests, fix bugs, and ship - full autonomous loop
  4. 04Self-healing test updates are validated by mutation testing before merge
  5. 05Production-to-test pipeline latency is under 1 hour (failure observed to regression test committed)
Climb the matrix

You don't have to figure this out alone.

Every level in this matrix has a path. Read the playbooks the teams that have climbed it wrote. Run the assessment with our consultants. Start where you are.

Live with Visdom

Book an AI Maturity Assessment session with your team.

We walk you through all four perspectives, score where you actually are, and leave you with a 90-day plan to climb in the dimensions that matter most.

Book an assessment See what's included90-day plan - scored assessment - coaching
Author Commentary

The May 2026 zeitgeist is trust, but audit.

April delivered three uncomfortable proofs: Stella Laurenzo's audit of 6,852 Claude Code sessions measured a 73% collapse in median thinking length (2,200 -> 600 chars) and a drop from 6.6 to 2.0 files read before edit; Anthropic's April 23 postmortem confirmed that harness and system-prompt changes - not the model - caused weeks of regression; UC Berkeley showed all eight major agent benchmarks are reward-hackable to ~100%. The takeaway is not "AI got worse." The takeaway is that capability lives in the harness, the prompt and the permission layer - and you need to measure it like any other production system.

This shifts the bar at every level. L2 review now means tracking thinking-length and files-read-before-edit per session, not just diff awareness. L3 AI-review agents must self-verify their outputs (Opus 4.7 makes this mainstream). L4 auto-approval policies must NOT be derived from benchmark scores - they need behavioural signals. The model is better than ever (Opus 4.7: 87.6% SWE-bench Verified, Mythos Preview: 93.9%). The tooling is better than ever (Cursor 3.2 /multitask, Kairos/Dream Mode in Claude Code). The new gap is observability. Start there.

Other perspectives