Log
The Agent Benchmark Log is the performance governance record for every deployed AI agent. Each agent has a baseline set at deployment and periodic log entries added over time. The tool calculates whether each metric is holding, degrading, or well below baseline — and automatically flags agents that hit RED status on two consecutive entries.
This answers the question: is this agent still worth running? It does not pull live data — that is the Agent Metrics Dashboard's job. This is the structured log that makes performance decisions auditable.
- Resolution Rate (%): handled without escalation — higher is better
- Accuracy (%): correct or acceptable output rate — higher is better
- Time Saved / Week (hrs): vs manual baseline — higher is better
- Token Cost ($ per interaction): API or seat cost — lower is better
- Escalation Rate (%): routed to human after AI touch — lower is better
- User Satisfaction (1–5): average score — higher is better
- Green: within normal variance of baseline
- Amber: −5pp accuracy/resolution, −10% time saved, +20% cost, +10pp escalation, −0.5 satisfaction
- Red: −15pp accuracy/resolution, −25% time saved, +50% cost, +25pp escalation, −1.0 satisfaction
- Overall RAG = worst single metric across all six
Email-client layout: fixed left sidebar listing agents, detail panel on the right for the selected agent.
Agent name, description, phase, owner, deploy date, model, and entry count. Overall RAG badge on the right. Edit and Delete actions.
The six baseline values set at deployment. These are the benchmarks all log entries are measured against. Read-only after initial set — click Edit on the agent header to revise if re-calibration is needed. Baseline changes affect all existing RAG calculations retroactively.
Six metric tiles showing the most recent log entry values with delta vs baseline, trend arrow direction, and individual RAG color. Only visible once at least one log entry exists.
All log entries newest-first. Columns: RAG dot (overall status for that entry), date, all six metric values, notes (truncated with full text on hover), logged-by, and a delete button per row.
- Click "+ Add Agent" in the top bar
- Fill in: name, phase, owner, deploy date, model, description
- Enter the six baseline metric values (can be left blank and added later via Edit)
- Click Save — agent appears in sidebar and is selected automatically
Four seed agents load on first open to demonstrate the interface:
Clear seed agents by clicking Delete on each, or clear all by removing localStorage["msp-benchmark-v1"] from the browser console.
- Select the agent in the sidebar
- Click "+ Log Entry" above the performance log table
- Set date and logged-by initials
- Enter values for metrics you have data for — fields can be left blank
- Add notes: prompt changes, volume changes, incidents, or measurement context
- Click "Log Entry" — saved immediately, RAG tiles update
Not all six metrics need to be populated. Blank fields are excluded from RAG calculation — they do not drag the RAG down or pull it up.
No fixed cadence is enforced. Suggested starting point: monthly for stable agents, after every significant prompt or model change, and immediately after any incident. Retirement flag logic requires at least two entries to activate.
Overall RAG for an entry = worst single metric. One red metric makes the whole entry red.
Auto-triggers when the last two consecutive log entries both have overall RED status. A red banner appears on the agent detail panel.
- Click "Acknowledge" on the retirement banner to clear the flag
- If the next entry is also overall RED, the flag re-triggers
- If the next entry is amber or green, the flag stays cleared
- Sidebar shows ⚠ next to the agent name while flag is active
The retirement flag can also be set manually from the agent header actions. Useful for flagging an agent for qualitative concerns before two RED entries accumulate.
When the Agent Metrics Dashboard syncs, this tool detects the payload on load and shows a green banner at the bottom of the page.
- Agent Metrics Dashboard writes payload to localStorage["msp-benchmark-sync"]
- Open this tool in the same browser — green banner appears immediately
- Banner shows: agent hint, source, brief description of what is pre-filled
- Click "Import and Pre-fill Entry" — tool matches agent hint to a loaded agent, selects it, opens Log Entry modal
- Resolution rate, escalation rate, time saved, cost are pre-filled
- Accuracy and satisfaction remain blank — enter from spot-check and survey
- Notes field pre-filled with sync source and date
- Click "Log Entry" — the sync key is cleared from localStorage after save
Click "Dismiss" on the banner. The sync key is cleared. Re-sync from the Agent Metrics Dashboard to get the payload back.
localStorage.removeItem("msp-benchmark-v1")