Evaluation and Metrics · Faysal Ahmed

What to Measure

Evaluation of AI agents is different from evaluation of traditional software. Features either work or they do not. Agent behaviour is probabilistic, context-dependent, and changes over time as models update and user patterns shift.

A good evaluation framework captures three dimensions: task completion (did it work?), interaction quality (was the experience good?), and safety (did anything go wrong?).

Task Completion Metrics

The primary measure of an agent is whether it accomplishes the user’s goal:

Metric	Definition	How to measure
Task success rate	% of tasks completed without user intervention	Automated from action logs
First-attempt success	% of tasks completed on the first try	Track retry and escalation events
Time to completion	Time from request to finished action	Compare to manual baseline
User correction rate	% of tasks where user modified the result	Log override and undo events
Escalation rate	% of tasks escalated to human	Count escalation triggers

Table 8.1 — Core task completion metrics for agent evaluation.

Interaction Quality Metrics

Beyond task completion, measure how the interaction feels:

Feedback submission rate — are users bothering to rate or correct the agent? Low submission may indicate poor feedback design, not satisfaction.
Feedback sentiment — aggregate thumbs up/down or star ratings.
Intervention effort — how many clicks or keystrokes does it take to correct an error? Lower is better.
Abandonment rate — how often do users give up on the agent and do the task manually?
Adoption persistence — does usage increase, decrease, or plateau over weeks and months?

Design pattern

Build instrumentation into the agent from day one. If you cannot tell whether the agent is getting better or worse over time, you cannot improve it systematically.

Safety and Harm Metrics

Safety metrics track negative outcomes:

Harm incidents — number of events where the agent caused financial, reputational, or emotional harm.
Guardrail violations — how often the agent attempted an action outside its defined scope.
Bias incidents — systematically different treatment of user groups.
Recovery time — time from incident detection to remediation.

These metrics should be reviewed at the organisational governance level, not just by the product team.

User Satisfaction Surveys

Quantitative metrics miss what users feel. Supplement with qualitative instruments:

Trust survey — “I trust this agent to act on my behalf” (1–5 scale).
Control survey — “I feel in control of what this agent does” (1–5 scale).
Transparency survey — “I understand why this agent makes the decisions it does” (1–5 scale).
Net Promoter Score — “Would you recommend this agent to a colleague?”

Anti-pattern

Measuring only task success rate and ignoring user sentiment. An agent that completes 95% of tasks but makes users feel anxious or out of control will eventually be abandoned.

A/B Testing Agent Behaviour

Changes to agent behaviour should be tested like any product change:

Shadow comparison — new agent version runs in parallel with the current version; compare outcomes without user impact.
User segment rollout — roll out changes to a small user segment first, measure all metrics, then expand.
Holdout groups — keep a control group on the old version to measure relative improvement.

Statistical significance matters. Agent improvements can be small per-interaction but compound across thousands of interactions.

Continuous Evaluation Pipeline

Evaluation should not be a point-in-time activity. Build a continuous pipeline:

Production logs → feature extraction → metric computation → dashboard → alerting

Daily metrics — task success, error rates, latency.
Weekly reviews — trend analysis, incident review, feedback sentiment.
Monthly deep dives — cohort analysis, bias checks, survey results.
Quarterly governance review — overall performance against goals, decisions about scope changes.

Key Takeaways

Measure three dimensions: task completion, interaction quality, and safety.
Instrument the agent from day one — you cannot improve what you do not measure.
Supplement quantitative metrics with qualitative surveys about trust, control, and transparency.
Test behaviour changes with shadow comparisons, user segments, and holdout groups.
Build a continuous evaluation pipeline with daily, weekly, monthly, and quarterly rhythms.

Next: Chapter 9 — Case Studies