← Back to Work
Framework + Tool AI Ethics Consulting

AI Agent Evaluation Tool

A systematic framework for evaluating AI agents across trust, usability, accessibility, and compliance dimensions β€” combining NIST RMF, EU AI Act, Microsoft HAX, Nielsen Heuristics, and WCAG into one actionable toolkit.

Dimensions 6
Criteria 40
Frameworks 12+
πŸ” Trust & Transparency
✨ Usability
πŸ”§ Error Recovery
β™Ώ Accessibility
πŸ›‘οΈ Safety
πŸŽ›οΈ Autonomy

The Challenge

How do you evaluate AI when the rules keep changing?

Traditional UX evaluation methods weren't designed for systems that learn, produce probabilistic outputs, and operate autonomously. When an AI agent's behavior isn't deterministic, how do you test for clarity? When the system adapts over time, how do you audit trust?

Existing frameworks address pieces of the puzzle β€” Microsoft HAX covers human-AI interaction, Nielsen heuristics assess usability, WCAG ensures accessibility β€” but none provide a unified evaluation approach for modern AI agents.

Framework Methodology

This framework synthesizes 12+ established guidelines into 6 evaluation dimensions, each with weighted criteria and severity levels for prioritized recommendations.

Source Frameworks

AI-Specific
  • NIST AI Risk Management Framework
  • EU AI Act Compliance
  • Microsoft HAX Guidelines (18 principles)
  • Google PAIR Guidelines
  • Anthropic Constitutional AI
  • IBM AI Ethics
Foundational UX
  • Nielsen's 10 Usability Heuristics
  • Google Conversational Design
  • Apple Human Interface Guidelines
  • WCAG 2.1 (AA level)
  • WAI-ARIA Best Practices
  • Stanford HAI Guidelines

6 Evaluation Dimensions

πŸ”

Trust & Transparency

Weight: 1.2Γ—

Confidence calibration, source attribution, capability disclosure, decision explanation

8 criteria
✨

Usability & Learnability

Weight: 1.0Γ—

System status, recognition over recall, consistency, conversational turn-taking

7 criteria
πŸ”§

Error Recovery

Weight: 1.1Γ—

Error clarity, repair strategies, human escalation, graceful degradation

6 criteria
β™Ώ

Accessibility

Weight: 1.0Γ—

Screen reader support, keyboard navigation, cognitive load, multi-modal I/O

7 criteria
πŸ›‘οΈ

Safety & Compliance

Weight: 1.3Γ—

Risk classification, harm prevention, data privacy, audit trails, bias mitigation

6 criteria
πŸŽ›οΈ

Autonomy & Control

Weight: 1.2Γ—

Human-in-the-loop, override mechanisms, scope boundaries, feedback loops

6 criteria

Sample Evaluation: Dust.tt

Dust

Enterprise AI agent platform for building custom agents that connect to company data.

Multi-Agent System Enterprise RAG-based

Evaluation Tool

Use this interactive tool to evaluate any AI agent. Your progress is saved automatically.

Need help evaluating your AI products?

I help teams audit AI experiences using structured frameworks, turning complex requirements into actionable design improvements.