Why Safety Is a UX Concern
Safety in AI systems is often framed as a model-level problem — alignment, guardrails, red-teaming. But from the user’s perspective, safety is an interaction problem. The user needs to know: can I trust this system not to harm me when something goes wrong?
A safe AI agent is not one that never fails — it is one that fails in predictable, recoverable, and transparent ways.
The Failure Taxonomy
Not all failures are alike. Different failure modes require different mitigation strategies:
| Failure mode | Description | Example | Severity |
|---|---|---|---|
| Hallucination | Model generates false information confidently | Agent states a meeting was confirmed when it was not | High |
| Misinterpretation | Model misunderstands user intent | User says "delete the old version" — agent deletes the current one | High |
| Ambiguity | Multiple valid interpretations | "Schedule it next week" — Monday or Friday? | Medium |
| Context drift | Model loses track of earlier context | Forgets a constraint set earlier in the conversation | Medium |
| Bias | Model produces systematically skewed output | Prioritises certain senders over others without justification | Medium |
| Timeout / resource limit | Task takes too long or exceeds capacity | Agent stalls while processing a large email thread | Low |
Designing for Graceful Degradation
When an agent cannot complete a task, it should degrade along a predictable path:
- Attempt — try the primary path.
- Fall back — try a simpler approach or reduced scope.
- Escalate — ask the user for guidance.
- Abort — stop and clearly communicate what was not done.
def schedule_meeting(request):
if confidence > 0.9:
auto_schedule(request)
elif confidence > 0.6:
propose_times(request) # fall back to suggestion
else:
escalate(request, reason="cannot determine availability")
Always communicate which degradation level the system is operating at. A user who sees "I could not resolve this automatically — here are three options" trusts the system more than one who receives a silent partial completion.
Guardrails and Constraints
Guardrails prevent the agent from operating outside safe bounds:
- Scope guardrails — limit which systems, data, or actions the agent can access.
- Value guardrails — enforce business rules (“never approve invoices over $1,000 without manager approval”).
- Temporal guardrails — prevent actions outside allowed time windows.
- Rate guardrails — cap the number or frequency of actions.
Guardrails should be transparent and user-configurable. A hard-coded guardrail that surprises the user is a trust erosion event.
The Confirmation Boundary
Not every action needs confirmation, but some do. Define a confirmation boundary based on:
- Irreversibility — can the action be undone? If not, require confirmation.
- Impact — does the action affect other people or systems? If yes, confirm.
- Cost — does the action have financial or reputational cost? If yes, confirm.
- Novelty — has the agent performed this action before? If not, confirm.
Asking for confirmation on every action defeats the purpose of an agent. Users will develop confirmation fatigue and blindly approve. Reserve confirmations for genuinely high-stakes decisions.
Safety Nets
Beyond guardrails, design safety nets that catch failures after they occur:
- Activity journal — immutable log of every action the agent takes, with before/after state where possible.
- Time-travel undo — ability to roll back to a previous state (e.g., restore calendar to yesterday’s state).
- Kill switch — immediate suspension of all agent activity with a single action.
- Human escalation path — clear way to involve a human operator when the agent cannot resolve an issue.
Testing for Safety
Test agent interactions against these scenarios:
- Adversarial input — does the user’s phrasing cause the agent to bypass guardrails?
- Out-of-distribution requests — what happens when asked to do something clearly outside scope?
- Cascade failures — if one dependent service fails, does the agent handle it gracefully or compound the error?
- Long-running tasks — does the agent maintain context and safety constraints over hours or days?
Key Takeaways
- Safe agents fail predictably and transparently, not perfectly.
- Design degradation paths: attempt → fall back → escalate → abort.
- Define clear confirmation boundaries based on irreversibility, impact, cost, and novelty.
- Provide safety nets: journal, undo, kill switch, escalation path.
- Test adversarial, out-of-distribution, cascade, and long-running scenarios explicitly.