KB // Agent Benchmark

Knowledge Base · MSP Operations

Agent Benchmark
Log

Complete reference for the Agent Benchmark Log — the governance record for deployed AI agent performance. Tracks six benchmark metrics against deployment baselines, calculates RAG status automatically, surfaces retirement flags for underperforming agents, and accepts metric data synced from the Agent Metrics Dashboard.

PERFORMANCE TRACKING RAG STATUS RETIREMENT SIGNALS BENCHMARK SYNC

01What Is This Tool

The Agent Benchmark Log is the performance governance record for every deployed AI agent. Each agent has a baseline set at deployment and periodic log entries added over time. The tool calculates whether each metric is holding, degrading, or well below baseline — and automatically flags agents that hit RED status on two consecutive entries.

This answers the question: is this agent still worth running? It does not pull live data — that is the Agent Metrics Dashboard's job. This is the structured log that makes performance decisions auditable.

Six Tracked Metrics

Resolution Rate (%): handled without escalation — higher is better
Accuracy (%): correct or acceptable output rate — higher is better
Time Saved / Week (hrs): vs manual baseline — higher is better
Token Cost ($ per interaction): API or seat cost — lower is better
Escalation Rate (%): routed to human after AI touch — lower is better
User Satisfaction (1–5): average score — higher is better

RAG Thresholds

Green: within normal variance of baseline
Amber: −5pp accuracy/resolution, −10% time saved, +20% cost, +10pp escalation, −0.5 satisfaction
Red: −15pp accuracy/resolution, −25% time saved, +50% cost, +25pp escalation, −1.0 satisfaction
Overall RAG = worst single metric across all six

02Page Layout

Email-client layout: fixed left sidebar listing agents, detail panel on the right for the selected agent.

Left Sidebar

RAG DotGreen/amber/red circle showing overall current status. Grey = no entries yet.

Agent NameAppended with ⚠ if a retirement flag is active.

Phase ChipDeployment phase (01–04) associated with the agent.

Last DateMost recent log entry date, or "No entries".

SearchFilters agents by name as you type.

Detail Panel — Agent Header

Agent name, description, phase, owner, deploy date, model, and entry count. Overall RAG badge on the right. Edit and Delete actions.

⚠

Delete is permanent — removes the agent and all log entries. For end-of-life agents use the Agent Version Tracker archive workflow instead.

Detail Panel — Baseline Block

The six baseline values set at deployment. These are the benchmarks all log entries are measured against. Read-only after initial set — click Edit on the agent header to revise if re-calibration is needed. Baseline changes affect all existing RAG calculations retroactively.

Detail Panel — Current vs Baseline Tiles

Six metric tiles showing the most recent log entry values with delta vs baseline, trend arrow direction, and individual RAG color. Only visible once at least one log entry exists.

Detail Panel — Performance Log Table

All log entries newest-first. Columns: RAG dot (overall status for that entry), date, all six metric values, notes (truncated with full text on hover), logged-by, and a delete button per row.

03Managing Agents

Adding a New Agent

Click "+ Add Agent" in the top bar
Fill in: name, phase, owner, deploy date, model, description
Enter the six baseline metric values (can be left blank and added later via Edit)
Click Save — agent appears in sidebar and is selected automatically

ℹ

Baselines can be set later. Metric tiles show "No baseline set" and RAG shows grey until baselines are entered.

Seed Data

Four seed agents load on first open to demonstrate the interface:

IVR Triage AgentGreen. Three months of steady improvement.

NOC Alert SummarizerRed. Two consecutive red entries — retirement flag active.

Ticket Intake ClassifierGreen. Best performer. Cost flat, satisfaction highest.

Copilot Prompt AssistantAmber trending to red. Low adoption, rising cost.

Clear seed agents by clicking Delete on each, or clear all by removing localStorage["msp-benchmark-v1"] from the browser console.

04Logging Performance Entries

Adding an Entry

Select the agent in the sidebar
Click "+ Log Entry" above the performance log table
Set date and logged-by initials
Enter values for metrics you have data for — fields can be left blank
Add notes: prompt changes, volume changes, incidents, or measurement context
Click "Log Entry" — saved immediately, RAG tiles update

ℹ

Entry modal pre-fills placeholder text with the baseline value (e.g. "Baseline: 82.0") so you can see what you are comparing against as you type.

Partial Entries

Not all six metrics need to be populated. Blank fields are excluded from RAG calculation — they do not drag the RAG down or pull it up.

Recommended Cadence

No fixed cadence is enforced. Suggested starting point: monthly for stable agents, after every significant prompt or model change, and immediately after any incident. Retirement flag logic requires at least two entries to activate.

05RAG Calculation Logic

Resolution Rate (higher better)Amber if −5pp from baseline. Red if −15pp.

Accuracy (higher better)Amber if −5pp from baseline. Red if −15pp.

Time Saved (higher better)Amber if −10% from baseline. Red if −25%. Percentage drop, not pp.

Token Cost (lower better)Amber if +20% above baseline. Red if +50%. Percentage rise.

Escalation Rate (lower better)Amber if +10pp above baseline. Red if +25pp.

Satisfaction (higher better)Amber if −0.5 points from baseline. Red if −1.0 points.

Overall RAG for an entry = worst single metric. One red metric makes the whole entry red.

ℹ

Thresholds are adjustable in the METRICS array in the script block. If the baseline was set during a peak-performance period, amber may trigger too aggressively — lower the thA values to widen the green band.

06Retirement Flag

Auto-triggers when the last two consecutive log entries both have overall RED status. A red banner appears on the agent detail panel.

Acknowledging

Click "Acknowledge" on the retirement banner to clear the flag
If the next entry is also overall RED, the flag re-triggers
If the next entry is amber or green, the flag stays cleared
Sidebar shows ⚠ next to the agent name while flag is active

⚠

Acknowledging records that the flag was reviewed — it does not decommission the agent. Formal end-of-life requires the Agent Version Tracker archive workflow.

Manual Flag

The retirement flag can also be set manually from the agent header actions. Useful for flagging an agent for qualitative concerns before two RED entries accumulate.

07Sync from Agent Metrics Dashboard

When the Agent Metrics Dashboard syncs, this tool detects the payload on load and shows a green banner at the bottom of the page.

Import Flow

Agent Metrics Dashboard writes payload to localStorage["msp-benchmark-sync"]
Open this tool in the same browser — green banner appears immediately
Banner shows: agent hint, source, brief description of what is pre-filled
Click "Import and Pre-fill Entry" — tool matches agent hint to a loaded agent, selects it, opens Log Entry modal
Resolution rate, escalation rate, time saved, cost are pre-filled
Accuracy and satisfaction remain blank — enter from spot-check and survey
Notes field pre-filled with sync source and date
Click "Log Entry" — the sync key is cleared from localStorage after save

ℹ

If the agent name in the sync payload does not match any loaded agent, the modal opens without a pre-selection. Select the agent manually before saving.

Dismissing Without Importing

Click "Dismiss" on the banner. The sync key is cleared. Re-sync from the Agent Metrics Dashboard to get the payload back.

08Data Storage and Export

Storage KeylocalStorage["msp-benchmark-v1"]. Persists across browser sessions unless storage is cleared.

Export JSON"↓ Export" in the top bar downloads the full agents array as a dated JSON file. Use for archiving.

Clear All DataRemove the key from browser console: localStorage.removeItem("msp-benchmark-v1")

Origin-SpecificlocalStorage is tied to the file origin. Moving files to a different domain will not show existing data.

⚠

Export JSON periodically as a backup. localStorage can be cleared by browser cache clearing, private browsing mode, or OS-level storage limits.

09FAQ & Troubleshooting

RAG tiles show greyNo log entries added yet. Add the first entry to activate RAG.

Retirement flag not clearingRe-triggers if the next entry after acknowledgement is also overall RED. Resolve the underlying performance issue first.

Sync banner not appearing1. Confirm Sync button was clicked on the Agent Metrics Dashboard. 2. Confirm both tools are in the same browser profile. 3. Check localStorage is not blocked by extensions or private browsing.

Agent hint does not match on importImport does a case-insensitive partial name match. If no match found, select the agent manually after the modal opens.

Baseline values appear missing after EditFields retain saved values on edit. If values appear missing after save, confirm the baseline inputs were not accidentally cleared before clicking Save.

nav.js 404 in consolePlace nav.js one directory above tools/ or update the script src to an absolute URL.