Knowledge Base · MSP Operations
Agent Benchmark
Log
Complete reference for the Agent Benchmark Log — the governance record for deployed AI agent performance. Tracks six benchmark metrics against deployment baselines, calculates RAG status automatically, surfaces retirement flags for underperforming agents, and accepts metric data synced from the Agent Metrics Dashboard.
PERFORMANCE TRACKING RAG STATUS RETIREMENT SIGNALS BENCHMARK SYNC

The Agent Benchmark Log is the performance governance record for every deployed AI agent. Each agent has a baseline set at deployment and periodic log entries added over time. The tool calculates whether each metric is holding, degrading, or well below baseline — and automatically flags agents that hit RED status on two consecutive entries.

This answers the question: is this agent still worth running? It does not pull live data — that is the Agent Metrics Dashboard's job. This is the structured log that makes performance decisions auditable.

Six Tracked Metrics
  • Resolution Rate (%): handled without escalation — higher is better
  • Accuracy (%): correct or acceptable output rate — higher is better
  • Time Saved / Week (hrs): vs manual baseline — higher is better
  • Token Cost ($ per interaction): API or seat cost — lower is better
  • Escalation Rate (%): routed to human after AI touch — lower is better
  • User Satisfaction (1–5): average score — higher is better
RAG Thresholds
  • Green: within normal variance of baseline
  • Amber: −5pp accuracy/resolution, −10% time saved, +20% cost, +10pp escalation, −0.5 satisfaction
  • Red: −15pp accuracy/resolution, −25% time saved, +50% cost, +25pp escalation, −1.0 satisfaction
  • Overall RAG = worst single metric across all six

Email-client layout: fixed left sidebar listing agents, detail panel on the right for the selected agent.

Left Sidebar
RAG DotGreen/amber/red circle showing overall current status. Grey = no entries yet.
Agent NameAppended with ⚠ if a retirement flag is active.
Phase ChipDeployment phase (01–04) associated with the agent.
Last DateMost recent log entry date, or "No entries".
SearchFilters agents by name as you type.
Detail Panel — Agent Header

Agent name, description, phase, owner, deploy date, model, and entry count. Overall RAG badge on the right. Edit and Delete actions.

Delete is permanent — removes the agent and all log entries. For end-of-life agents use the Agent Version Tracker archive workflow instead.
Detail Panel — Baseline Block

The six baseline values set at deployment. These are the benchmarks all log entries are measured against. Read-only after initial set — click Edit on the agent header to revise if re-calibration is needed. Baseline changes affect all existing RAG calculations retroactively.

Detail Panel — Current vs Baseline Tiles

Six metric tiles showing the most recent log entry values with delta vs baseline, trend arrow direction, and individual RAG color. Only visible once at least one log entry exists.

Detail Panel — Performance Log Table

All log entries newest-first. Columns: RAG dot (overall status for that entry), date, all six metric values, notes (truncated with full text on hover), logged-by, and a delete button per row.

Adding a New Agent
  • Click "+ Add Agent" in the top bar
  • Fill in: name, phase, owner, deploy date, model, description
  • Enter the six baseline metric values (can be left blank and added later via Edit)
  • Click Save — agent appears in sidebar and is selected automatically
Baselines can be set later. Metric tiles show "No baseline set" and RAG shows grey until baselines are entered.
Seed Data

Four seed agents load on first open to demonstrate the interface:

IVR Triage AgentGreen. Three months of steady improvement.
NOC Alert SummarizerRed. Two consecutive red entries — retirement flag active.
Ticket Intake ClassifierGreen. Best performer. Cost flat, satisfaction highest.
Copilot Prompt AssistantAmber trending to red. Low adoption, rising cost.

Clear seed agents by clicking Delete on each, or clear all by removing localStorage["msp-benchmark-v1"] from the browser console.

Adding an Entry
  • Select the agent in the sidebar
  • Click "+ Log Entry" above the performance log table
  • Set date and logged-by initials
  • Enter values for metrics you have data for — fields can be left blank
  • Add notes: prompt changes, volume changes, incidents, or measurement context
  • Click "Log Entry" — saved immediately, RAG tiles update
Entry modal pre-fills placeholder text with the baseline value (e.g. "Baseline: 82.0") so you can see what you are comparing against as you type.
Partial Entries

Not all six metrics need to be populated. Blank fields are excluded from RAG calculation — they do not drag the RAG down or pull it up.

Recommended Cadence

No fixed cadence is enforced. Suggested starting point: monthly for stable agents, after every significant prompt or model change, and immediately after any incident. Retirement flag logic requires at least two entries to activate.

Resolution Rate (higher better)Amber if −5pp from baseline. Red if −15pp.
Accuracy (higher better)Amber if −5pp from baseline. Red if −15pp.
Time Saved (higher better)Amber if −10% from baseline. Red if −25%. Percentage drop, not pp.
Token Cost (lower better)Amber if +20% above baseline. Red if +50%. Percentage rise.
Escalation Rate (lower better)Amber if +10pp above baseline. Red if +25pp.
Satisfaction (higher better)Amber if −0.5 points from baseline. Red if −1.0 points.

Overall RAG for an entry = worst single metric. One red metric makes the whole entry red.

Thresholds are adjustable in the METRICS array in the script block. If the baseline was set during a peak-performance period, amber may trigger too aggressively — lower the thA values to widen the green band.

Auto-triggers when the last two consecutive log entries both have overall RED status. A red banner appears on the agent detail panel.

Acknowledging
  • Click "Acknowledge" on the retirement banner to clear the flag
  • If the next entry is also overall RED, the flag re-triggers
  • If the next entry is amber or green, the flag stays cleared
  • Sidebar shows ⚠ next to the agent name while flag is active
Acknowledging records that the flag was reviewed — it does not decommission the agent. Formal end-of-life requires the Agent Version Tracker archive workflow.
Manual Flag

The retirement flag can also be set manually from the agent header actions. Useful for flagging an agent for qualitative concerns before two RED entries accumulate.

When the Agent Metrics Dashboard syncs, this tool detects the payload on load and shows a green banner at the bottom of the page.

Import Flow
  • Agent Metrics Dashboard writes payload to localStorage["msp-benchmark-sync"]
  • Open this tool in the same browser — green banner appears immediately
  • Banner shows: agent hint, source, brief description of what is pre-filled
  • Click "Import and Pre-fill Entry" — tool matches agent hint to a loaded agent, selects it, opens Log Entry modal
  • Resolution rate, escalation rate, time saved, cost are pre-filled
  • Accuracy and satisfaction remain blank — enter from spot-check and survey
  • Notes field pre-filled with sync source and date
  • Click "Log Entry" — the sync key is cleared from localStorage after save
If the agent name in the sync payload does not match any loaded agent, the modal opens without a pre-selection. Select the agent manually before saving.
Dismissing Without Importing

Click "Dismiss" on the banner. The sync key is cleared. Re-sync from the Agent Metrics Dashboard to get the payload back.

Storage KeylocalStorage["msp-benchmark-v1"]. Persists across browser sessions unless storage is cleared.
Export JSON"↓ Export" in the top bar downloads the full agents array as a dated JSON file. Use for archiving.
Clear All DataRemove the key from browser console: localStorage.removeItem("msp-benchmark-v1")
Origin-SpecificlocalStorage is tied to the file origin. Moving files to a different domain will not show existing data.
Export JSON periodically as a backup. localStorage can be cleared by browser cache clearing, private browsing mode, or OS-level storage limits.
RAG tiles show greyNo log entries added yet. Add the first entry to activate RAG.
Retirement flag not clearingRe-triggers if the next entry after acknowledgement is also overall RED. Resolve the underlying performance issue first.
Sync banner not appearing1. Confirm Sync button was clicked on the Agent Metrics Dashboard. 2. Confirm both tools are in the same browser profile. 3. Check localStorage is not blocked by extensions or private browsing.
Agent hint does not match on importImport does a case-insensitive partial name match. If no match found, select the agent manually after the modal opens.
Baseline values appear missing after EditFields retain saved values on edit. If values appear missing after save, confirm the baseline inputs were not accidentally cleared before clicking Save.
nav.js 404 in consolePlace nav.js one directory above tools/ or update the script src to an absolute URL.