Platform Intelligence Dashboard — making agent-readiness visible

Situation

As AI coding agents became genuinely useful, our platform org faced a question it had no way to answer: which of our systems can agents (and junior engineers) safely make changes to, and which can’t? “Agent-readiness” was an abstract, hand-wavy idea, and any assessment of it risked dying the usual death — a one-off slide deck argued over in a committee and never looked at again. The deeper goal underneath the question: an environment where agents can be trusted with low-level coding tasks and reliably produce production-ready code, instead of every change needing a human review.

Decision / Approach

I made the abstract concrete in two moves. First, I defined agent-readiness as a scored benchmark — a small set of criteria (clear domain boundaries, documented APIs, standardised data models, observable state, deployment safety, test coverage, codebase simplicity), each mapped to a signal that can be measured rather than argued.

Second, I shipped the results as a live, auto-updating dashboard rather than a document — so the picture stays continuously current and shared, instead of a snapshot that’s stale the day it’s written. It surfaces (a) per-system benchmark scores across the core backend systems on a 0–5 maturity scale with green / amber / red bands, and (b) operational stats for the maintenance agents (a dead-code cleaner, an incident fixer, a dependency updater).

The dashboard is deliberately the measure layer of something larger: the same criteria are meant to be enforced through agents and extended through skills, so agent-readiness isn’t just observed — it’s established and held. That’s the path to trusting agents with low-level coding without reviewing every change.

What I deliberately didn’t do: wait for a polished, all-green launch. I shipped with one system fully scored and the rest visibly marked “not yet scored.” Honest scaffolding beats both over-promising (“all systems scored by date X”) and delaying until everything’s perfect.

Risk / What could break

The dashboard’s value rests entirely on trust in the numbers. If a criterion can be gamed, or a score goes stale, the signal degrades silently — and auto-update makes a broken upstream metric look authoritative. It’s also politically charged: a red score reads as a finger pointed at a team. Mitigations: every score traces back to a raw, re-runnable measurement; the scoring methodology is explicit and reviewable; and the framing is “where to invest next,” not “who is failing.”

Change / Outcome & learning

It became a shared, current view that aligns engineering leadership on where to invest in the platform — one picture instead of competing anecdotes — and the foundation for the larger arc from measuring agent-readiness to enforcing it. The lesson I’d carry forward: measurement only changes behaviour when it’s visible and effortless to check. A spreadsheet someone has to open and interpret dies; a URL that is always current gets used in the conversations that actually allocate work. If I did it again, I’d wire score-staleness alerts from day one rather than adding them after a number first drifted.