Close

Designing Transparency into Agentic AI

PR→PO Copilot is an AI-assisted procurement prototype where an LLM-powered copilot guides requisition creation, while deterministic tools and rules handle lookups, routing, and audit logging. The challenge: how do you make AI decisions visible, auditable, and trustworthy in enterprise workflows?

Design

Exploration

4

AI Agents Mapped

4

Transparency Patterns

Full

Working Prototype

Project
Design exploration prototype
Timeline
2025–2026 (Design exploration)
Platform
Web (Next.js)
My role
UX Design + Prototyping (AI-assisted development)
Overview

From "AI magic" to enterprise clarity

This is a spending management copilot that transforms how enterprises handle procurement requests. The system combines LLM-powered natural language understanding with deterministic business rules — the AI interprets what users want to buy, while rule-based agents handle catalog lookups, supplier validation, approval routing, and audit logging. This separation ensures every decision is traceable and compliant.

Three user roles, one unified experience

Employees (Requesters)

Describe what they need in plain language — "I need a laptop for the new hire starting Monday" — and the copilot handles catalog matching, supplier selection, and form completion automatically.

Managers (Approvers)

See AI-generated summaries with risk signals, budget impact, and policy compliance status at a glance. Approve, reject, or request more information without switching between ERP screens.

Buyers (Procurement)

Convert approved PRs into POs with copilot suggestions for consolidation, alternate suppliers, and contract optimization. Override AI recommendations with mandatory reason capture for continuous learning.

Core Design Challenge How do you make AI decisions visible, auditable, and trustworthy in enterprise workflows? Most AI chat interfaces feel like a "black box" — users can't tell what the AI actually did, where the data came from, or why it made certain recommendations. This case study explores transparency patterns that build justified trust without overwhelming users.
LLM Boundary (Critical Architecture Decision) The LLM never selects supplier IDs, calculates totals, or makes approval decisions. It only interprets user intent and generates human-readable summaries. All "decisions" flow through deterministic tool agents with explicit rule IDs — making every action auditable and reproducible.
My Role: UX Design + Prototyping

What I designed:

  • Agent orchestration architecture (LLM boundaries, tool separation)
  • Transparency pattern system (Source Chips, Agent Actions, Rule IDs)
  • Information architecture and conversation flow design
  • Trust measurement framework and KPIs

AI-assisted execution: The working prototype was built with Claude Code, allowing rapid iteration. I directed the design decisions; AI accelerated the implementation.

Full 3-minute demo walkthrough
Process

How I approached this project

This design exploration followed a diverge-converge pattern, balancing rapid prototyping with structured validation.

Week 1-2

Research & Framing
  • Competitive analysis of AI procurement tools
  • Enterprise AI trust pattern research
  • Defined core design questions

Week 3-4

Architecture Design
  • Agent orchestration model
  • Service blueprint mapping
  • LLM boundary definition

Week 5-8

Prototyping & Iteration
  • AI-assisted UI development
  • Transparency pattern exploration
  • 3 major design iterations

Ongoing

Documentation
  • Pattern library creation
  • Case study writing
  • Measurement framework

Collaboration

While this is a solo design exploration, I simulated a cross-functional process: I wore multiple hats as product thinker (defining scope and priorities), domain researcher (procurement workflow analysis), and technical architect (API constraints and AI capabilities). In a production setting, this work would involve collaboration with procurement SMEs for domain validation, engineering leads for feasibility review, and compliance teams for audit requirements.

Key design decisions & trade-offs

Why separate LLM from deterministic tools?

Considered: End-to-end LLM with function calling for everything.
Chose: LLM for interpretation only; rule-based agents for decisions.
Rationale: Enterprise auditors need reproducible decision paths. "The AI decided" isn't acceptable for procurement compliance.

Why collapsed-by-default for Agent Actions?

Considered: Always-visible action log (like terminal output).
Chose: Progressive disclosure — collapsed by default, expandable on demand.
Rationale: 80% of users trust the summary; 20% want to verify. Don't punish the majority with information overload.

Why Source Chips instead of footnotes?

Considered: Traditional citations at bottom of messages.
Chose: Inline colored chips next to each data point.
Rationale: In high-frequency workflows, glanceability beats thoroughness. Users scan, not read.

Each decision was documented with alternatives considered — making it easier to revisit choices as requirements evolve.

System Design

Orchestration model

Before designing the UI, I mapped out how different agents coordinate in this workflow. The principle: "Make the agent's reasoning available on demand."

User Intent
Copilot (LLM)
Interpret intent · Generate summary · Ask clarifying questions
Tool Agents (Deterministic)
Catalog lookup · Totals calculation · Vendor validation · Policy checks
Rules Engine
Routing decisions · Rule IDs · Thresholds
Risk Signal Agent
Confidence flags · Missing info · Anomaly detection
Human Handoff
Low confidence · Policy conflict · Exception request

Agent responsibilities

Copilot (LLM)

Interprets user intent, generates human-readable summaries, and asks clarifying questions when input is ambiguous. Never makes decisions.

Tool Agents

Deterministic functions: catalog lookup, price totals, vendor validation. Each call logged with inputs/outputs for audit.

Rules Engine

Evaluates business policies and returns routing decisions with explicit rule IDs (e.g., AMOUNT-001). No ML uncertainty.

Risk Signal Agent

Flags low-confidence matches, missing required fields, and unusual patterns. Triggers human review when thresholds exceeded.

This layered architecture ensures the LLM handles interpretation while deterministic systems handle decisions—making the workflow auditable and enterprise-safe.

System flowchart: Happy Path vs. Conflict Path

This diagram maps the complete workflow, showing how the system handles both successful requests and edge cases requiring human intervention.

PR-PO Copilot system flowchart showing Happy Path (left) and Conflict Path (right). Happy Path: User intent flows through LLM Orchestrator to deterministic agents (Vendor Lookup, Item Search, Approval Router, Compliance), then to PR Draft and ERP submission. Conflict Path: When multiple matches or budget conflicts occur, the system triggers Human-in-the-Loop controls for user selection or manager escalation.
Service Design

From agent diagram to service blueprint

Before prototyping, I mapped the end-to-end journey showing frontstage interactions (user touchpoints), backstage processes (agent actions), and support systems (rules engine, audit log).

User Actions
Describe need Select items Fill details Submit PR
Frontstage
Copilot response Item cards Validation UI Routing banner
Backstage
Intent parsing Catalog lookup Rules check Risk scoring
Support
LLM API Catalog DB Rules Engine Audit Log

This blueprint informed the 4-layer agent model and identified where AI transparency patterns (Agent Actions, Source Chips) should appear in the flow.

The Problem

Users don't trust what they can't see

In most AI chat interfaces, the assistant's response looks like "LLM talking"—users can't tell:

  • What the AI actually did (Did it look up the catalog? Check inventory?)
  • Where the data came from (Is this from our approved vendor list?)
  • Why decisions were made (Why does this need Finance approval?)

"The AI just says 'I found 2 items'—but I can't verify if it checked our preferred vendors."

— Common user concern (pattern)

"If this goes to audit, how do I prove the AI followed our procurement policies?"

— Common governance concern (pattern)

This is the "black box" problem of agentic AI: powerful capabilities hidden behind opaque responses.

Who feels the pain?

Requesters (Employees)

Fear of procurement forms — unsure which supplier to select, which cost center applies, or what "commodity code" means. They end up emailing procurement for help, creating bottlenecks and delays for simple requests.

Approvers (Managers)

Overwhelmed by dense forms with 20+ fields. Can't quickly identify if a request exceeds budget, violates policy, or uses a non-preferred vendor. Often approve without full understanding due to time pressure.

Buyers (Procurement)

Time wasted on preventable issues: rejected POs due to incomplete data, returns from wrong supplier selection, chasing requesters for missing information. Strategic sourcing takes a back seat to firefighting.

Transformation

Before vs After

The AI Copilot fundamentally changes how procurement requests flow through the organization. Here's the comparison:

Before: Traditional Flow

1 User opens complex procurement form (20+ fields)
2 Emails procurement team for help with vendor codes
3 Waits 1-2 days for response
4 Submits form, rejected due to missing cost center
5 Resubmits, waits in approval queue
6 Approver skims form, approves without full review
Average time: 3-5 days

After: AI-Assisted Flow

1 User says "I need a laptop for new hire"
2 Copilot searches catalog, shows options with Evidence Chips
3 User selects, Copilot asks clarifying questions
4 Auto-validates against policies, flags issues
5 Approver sees summary card with risk highlights
6 One-click approval with full Audit Trail
Average time: ~15 minutes

The transformation isn't just about speed—it's about shifting cognitive load from users to the system while maintaining human control over decisions.

Conversational System

Conversation patterns

Beyond single-turn interactions, I mapped out multi-turn flows and designed explicit fallback and escalation patterns for high-risk scenarios.

Simplified dialogue flow

User: "I need a laptop"
Tool: Catalog lookup
Copilot: "Found 2 options..."
User: Select item
Missing data?
Clarifying Q
Rules check
✓ Submit PR

High-risk scenarios & escalation patterns

For each scenario, I designed specific fallback and recovery flows:

SCENARIO 1

Missing required data

Example: User doesn't specify cost center

Detect missing field Clarifying question Smart UI selector Evidence logged
SCENARIO 2

Low confidence match

Example: Ambiguous supplier or SKU selection

Show source chips Display confidence % Force user confirm Selection logged
SCENARIO 3

Policy conflict / Exception needed

Example: Request exceeds budget or requires special vendor

Show rule ID Explain conflict Offer "Request Exception" Human handoff

Each pattern maps UI components I've already designed—validation checklist for missing data, source chips for low confidence, routing banner for policy conflicts—into a coherent conversational system.

Design Challenge

Show the agent's work without overwhelming users

The core design tension in agentic AI interfaces:

  • Too little visibility → users don't trust the AI, hesitate to adopt, and escalate to humans unnecessarily
  • Too much information → cognitive overload, slower workflows, and users start ignoring the transparency features entirely

After exploring several approaches (always-visible agent logs, expandable side panels, inline annotations), I framed my design goal as:

"Make the agent's reasoning available on demand, not in the way."

This led to a principle: progressive disclosure of agent actions. The key insight is that trust comes from knowing you can verify, not from being forced to verify every time. Show evidence when users need it (verification, audit, debugging), hide complexity when they don't (routine tasks, trusted flows).

Solution 1

Agent Actions Panel

Under each assistant message, I added a collapsible "Agent Actions" panel that shows:

  • Tool name — what the agent did (e.g., "Lookup Catalog Items")
  • Result summary — outcome (e.g., "Matched 2 items from Engineering standard kit")
  • Evidence refs — data IDs for audit trail (e.g., SKU:LAPTOP-001)
Collapsed by default · Expandable on demand · Shows tool execution sequence

Design decisions

  • Collapsed by default — doesn't interrupt the conversation flow
  • Count badge — "Agent Actions (3)" signals there's content without revealing it
  • Status icons — green checkmarks for success, red X for errors

This panel surfaces deterministic tool execution logs (function calls + outputs) rather than free-form LLM explanations.

Solution 2

Source Chips

For line items, I designed source chips — color-coded badges that show where each piece of data came from. This addresses a common concern: "The AI says this supplier is good, but how do I know it actually checked our approved vendor list?"

  • Catalog — item found in the company's approved catalog (with SKU reference)
  • Standard kit — matches department standard equipment list (e.g., "Engineering standard laptop")
  • Preferred — vendor is on the preferred/contracted supplier list (with contract ID)
  • In stock — available in company warehouse (with location and quantity)
Source chips with hover tooltips showing reference IDs

Why this matters

Instead of text like "Matches Engineering standard kit" (which could be LLM hallucination), chips provide:

  • Scannable at a glance — color-coded chips are faster to scan than reading full sentences, allowing users to quickly assess compliance status for multiple line items
  • Audit-ready evidence — hovering reveals reference IDs (SKU, contract number, vendor ID) that can be traced back to source systems for audit verification
  • Honest attribution — users know exactly where each piece of data originated, building justified trust rather than blind faith in AI outputs
Solution 3

Routing Banner with Reasons

After submission, a routing banner appears at the top of the PR, showing the complete approval path and explaining why each approver is required. This addresses a common frustration: "Why does this simple request need Finance approval?"

Approval path: Manager → Finance · Each routing reason has a traceable rule ID

Rule-based transparency

Each routing reason includes a rule ID (e.g., AMOUNT-001, VENDOR-001) that maps directly to documented business policies. This design choice has several important implications:

  • Users understand "why" — instead of opaque routing, users see exactly which policy triggered each approval requirement (e.g., "AMOUNT-001: Orders over $5,000 require Finance approval")
  • Auditability by design — every routing decision can be traced back to a specific rule, making compliance verification straightforward for auditors
  • No "AI magic" — the system is fundamentally a rules-driven workflow surfaced through a conversational UI. The LLM helps users navigate the workflow, but deterministic rules make the actual decisions
Solution 4

Validation Checklist

Traditional form validation shows "40% complete" — but users don't know what's actually missing or what information they need to gather. This creates a frustrating guessing game that slows down submissions.

I designed a validation checklist that explicitly lists all required fields with their current status. The copilot can also help fill in missing information through conversation:

  • At least one line item
  • ○ Cost center — "Ask your department admin or check your previous PRs"
  • ○ Need-by date — "When do you actually need this?"
  • ○ Delivery location — "Office address or building code"
  • ○ Business reason — "Brief justification for the purchase"
Validation checklist shows exactly what's missing — click any item for guidance

This transforms "Submit disabled" from a frustrating dead-end into actionable guidance. Users know exactly what they need to provide, and the copilot can help them find or fill in the information through natural conversation.

Information Architecture

5 screens for the full workflow

Beyond the chat interface, I designed four additional screens to cover the complete PR→PO lifecycle. Each screen is tailored to a specific user role and task, ensuring the right information is surfaced at the right moment.

Chat + Draft Entry (Requester)

Split-panel layout: left side for natural language conversation with the copilot, right side shows the real-time PR draft with validation checklist, policy compliance hints, and remaining budget. Users can chat or directly edit fields — both paths sync automatically.

Manager Approval Card (Approver)

AI-generated summary card with risk tags (budget impact, policy flags, unusual patterns), key facts at a glance (requester, total amount, need-by date), and clear CTAs: Approve / Reject / Request Info. Designed for mobile-first approval workflows.

Buyer PO Workbench (Procurement)

Batch processing view for procurement professionals. Select multiple approved PRs, receive copilot suggestions for order consolidation, alternate suppliers with better pricing, or contract-compliant substitutions. Override AI recommendations with mandatory reason capture.

Audit Timeline (Compliance)

Complete audit trail: every action logged with timestamps, user IDs, and AI involvement markers. Shows who did what, when, and what AI assistance was used. Searchable by rule ID, user, date range, or action type. Exportable for compliance reporting.

Workflow Admin (Backoffice)

Canvas-based workflow builder for configuring agent orchestration. Administrators can drag-and-drop agent nodes, connect them with conditional logic, set confidence thresholds for human escalation, and configure business rules without code changes. View interactive prototype →

Design Principles

Honesty in prototyping

When building AI prototypes, it's tempting to hardcode impressive results. I set constraints for myself:

Hardcode rules, not results

The routing logic checks actual business rules (amount thresholds, supplier status), not a pre-determined path. This makes the prototype honest about what's deterministic vs. what would need real AI.

Mark assumptions clearly

Where I used simulated data, I added "Prototype rule" badges. This prevents stakeholders from overestimating what the demo can do.

Show evidence, not claims

Agent Actions shows actual function execution, not LLM-generated descriptions. The difference matters for trust and audit.

UX decisions for enterprise adoption

Reduce "form fear"

Use conversational guidance instead of overwhelming forms. Instead of presenting 20+ fields at once, the copilot asks targeted questions and auto-fills where possible. Users see only decision-relevant information at each step, with the full form available for power users.

Decisions happen in-context

Approvers see budget impact, risk signals, and policy flags on the same card — no switching between ERP screens to gather context. AI-generated summaries highlight what matters, with full details expandable on demand.

Clear AI boundaries

The LLM interprets user intent and generates human-readable summaries; deterministic rules engines make decisions with explicit rule IDs. This separation makes the system auditable, reproducible, and reduces the risk of LLM hallucination affecting critical business outcomes.

Enablement

Client-Ready Playbook

To enable clients and internal teams to adopt these patterns, I documented decision criteria and templates that can be used in discovery workshops, pitch decks, and implementation guides.

When to use LLM vs deterministic

  • LLM: Intent parsing, summarization, natural language generation
  • Deterministic: ID lookups, calculations, policy enforcement

Confidence thresholds

  • >90%: Auto-proceed with evidence
  • 70–90%: Show options, require selection
  • <70%: Force user confirmation

Standard escalation triggers

  • Policy rule conflict detected
  • Multiple low-confidence matches
  • User explicitly requests exception
  • Amount exceeds auto-approval threshold

Transparency UI patterns

  • Agent Actions Panel (collapsible)
  • Source Chips (color-coded provenance)
  • Rule IDs in routing banners
  • Validation checklists (not %)

Sample prompt patterns

// Clarifying question template
"I found {count} matching items. To proceed, I need:
  • {missing_field_1}
  • {missing_field_2}
You can select from the options below or type your answer."

// Policy conflict template
"This request triggers rule {rule_id}: {rule_description}.
You can:
  1. Modify the request to comply
  2. Request an exception (requires additional approval)"

Responsible AI checklist

✓ No PII in prompts ✓ Audit trail for all decisions ✓ Human-in-the-loop for high-risk ✓ Keyboard-navigable UI ✓ Screen reader labels ✓ Color not sole indicator
Estimated Impact

Hypothetical business outcomes

Note: These are hypothetical projections based on industry benchmarks, not measured data from a production deployment. The purpose is to illustrate the types of outcomes this design aims to enable.

5→2

days
PR→PO cycle time

30%→10%

reduction
rejection rate

~20%

reduction
maverick spending

How these estimates are derived

  • Cycle time reduction: Conversational UI + auto-filled fields reduces form friction; AI-surfaced risk signals enable faster approvals.
  • Rejection rate: Validation checklist catches missing fields before submission; source chips ensure correct supplier/SKU selection.
  • Maverick spending: Industry data suggests tail spend can represent up to 20% of procurement costs. Policy enforcement at creation time reduces off-contract purchases.
Edge Cases

Failure modes & fallbacks

AI systems fail. The design question isn't if but how gracefully. I mapped out failure scenarios and designed explicit fallback patterns for each:

STALE DATA

Price or inventory data outdated (>30 days)

Detection

Timestamp check on vendor data

UI Signal

"Price may have changed" warning chip

Fallback

Require re-quote before PO generation

POLICY CONFLICT

Multiple rules match with conflicting outcomes

Detection

Rules engine returns >1 blocking rule

UI Signal

Show both policies with rule IDs

Fallback

Escalate to compliance officer for resolution

NO MATCH

No approved vendor found for item category

Detection

Empty results from vendor lookup tool

UI Signal

"No preferred vendors" message + suggestions

Fallback

Offer "Expand search" or manual vendor entry

LOW CONFIDENCE

LLM uncertainty exceeds threshold

Detection

Intent classification confidence <70%

UI Signal

Don't auto-fill; show "I'm not sure" prompt

Fallback

Ask clarifying question or offer options

The principle: when the system doesn't know, it should say so — and provide a clear path forward rather than guessing.

Governance

Human-in-the-loop controls

In agentic workflows, humans need control without friction. I designed specific mechanisms to balance AI efficiency with human oversight:

Override mechanism design

Buyer Override

Change AI-suggested vendor with mandatory reason selection: Better price, Existing relationship, Faster delivery, Other

Evidence Capture

Override reason + original suggestion stored in audit log. Enables pattern detection across users.

Learning Signal

High override rate for a vendor → flag for catalog review. Human corrections inform system improvement.

When humans must intervene

REQUIRED
Exception requests — When PR violates policy and user requests override
REQUIRED
Non-catalog items — Manual vendor entry requires procurement review
RECOMMENDED
High-value orders — Above threshold, system recommends (doesn't require) review
OPTIONAL
Any time — User can always click "Talk to procurement" to escalate

The goal: humans can override at any point, but must override only when the system explicitly requires it and only in high-risk scenarios.

Measurement

Trust Scorecard

To validate that transparency patterns actually build trust, I defined a measurement framework with leading indicators (behavior) and lagging outcomes (business impact).

North Star Metric

Target

90%+

PO Approval Rate without Buyer Override

Trust metrics

Metric Baseline Target Why it matters
Evidence Coverage ~60% 95%+ Every line item has source attribution
Override Rate ~30% <10% Low override = AI suggestions are trusted
Source Chip Click Rate n/a 40%+ Users verify sources = transparency works
Time to Approve 15 min <5 min Trust + clarity = faster decisions
PO Return Rate ~20% <5% Validation catches errors before submission

Event tracking plan

To measure these, I defined specific events to instrument:

expand_agent_actions click_source_chip override_vendor request_requote escalate_to_human view_routing_reason submit_with_warning

Success threshold: If evidence coverage reaches 95%+ and override rate drops below 10%, we can conclude that transparency patterns are building justified trust — not blind trust.

Evaluation

KPIs & continuous learning

To continuously improve conversational AI experiences, I defined metrics that track both task success and user trust signals.

KPIs for prototype evaluation

Metric What it measures Signal
Task success rate PR submitted → PO generated Higher = better
Correction loop rate Clarifying exchanges per session Monitor trend
Escalation rate Human handoffs + reason breakdown Investigate high
Policy conflict rate Top 5 rules triggered Inform policy review
Time-to-complete Intent → Submit duration Lower = better
User trust signals Actions panel expand rate, source chip hover rate Qualitative insight

Learning from conversation logs

Structured logs enable continuous improvement:

  • Top intents: What do users ask for most? Are prompts optimized for these?
  • Failure points: Where do sessions drop off? Which fields cause most clarifying questions?
  • Prompt iteration: A/B test response templates based on correction loop data
  • Policy friction: Which rules trigger exceptions most often? Flag for policy review

Even in a prototype, designing for measurability creates a foundation for production iteration.

Reflection

What I learned

Key insights

Progressive disclosure is key for AI transparency

Users don't need to see every tool call—but they need to know they can see it. The collapsed-by-default pattern respects attention while enabling verification.

Color coding beats text for scanability

Source chips communicate data provenance faster than sentences. In high-frequency workflows, visual differentiation reduces cognitive load.

Rule IDs make AI auditable

Connecting every AI decision to a numbered business rule transforms "the AI decided" into "rule AMOUNT-001 triggered." This is the difference between magic and accountability.

Challenges encountered

Balancing transparency with overwhelm

Early iterations showed too much — every API call, every rule check. Users felt bombarded. The breakthrough was realizing transparency is about access to information, not forcing information. Progressive disclosure solved this.

Defining the LLM boundary

Initial designs let the LLM do more — including supplier selection. But every "AI decision" created an audit problem. The hardest part was convincing myself to restrict the AI's role, even though it could technically do more.

Designing without real users

As a design exploration, I couldn't validate with actual procurement teams. I compensated by studying published enterprise AI research and designing explicit measurement frameworks for future validation.

If I did this again

  • Start with user research: Even informal interviews with procurement professionals would have grounded assumptions earlier.
  • Build a simpler first version: I over-invested in the prototype UI before validating core patterns. A lower-fidelity test would have been more efficient.
  • Document failures earlier: Some of the most valuable learnings came from abandoned approaches, but I only documented them late in the process.

Accessibility & responsive considerations

Accessibility

  • Color coding supplemented with icons/labels (not color-only)
  • Progressive disclosure respects screen reader flow
  • High contrast ratios for Source Chips and status indicators
  • Keyboard-navigable expandable sections

Responsive Design

  • Chat interface scales for tablet/desktop
  • Collapsible panels for smaller viewports
  • Touch-friendly action buttons and chips
  • Mobile-first approach for future iteration

This project taught me that designing AI interfaces isn't about showing what the AI can do — it's about showing what the AI did, and giving users the confidence to act on it.

← Back to selected work