PR→PO Copilot｜Case study

This is a spending management copilot that transforms how enterprises handle procurement requests. The system combines LLM-powered natural language understanding with deterministic business rules — the AI interprets what users want to buy, while rule-based agents handle catalog lookups, supplier validation, approval routing, and audit logging. This separation ensures every decision is traceable and compliant.

Three user roles, one unified experience

Employees (Requesters)

Describe what they need in plain language — "I need a laptop for the new hire starting Monday" — and the copilot handles catalog matching, supplier selection, and form completion automatically.

Managers (Approvers)

See AI-generated summaries with risk signals, budget impact, and policy compliance status at a glance. Approve, reject, or request more information without switching between ERP screens.

Buyers (Procurement)

Convert approved PRs into POs with copilot suggestions for consolidation, alternate suppliers, and contract optimization. Override AI recommendations with mandatory reason capture for continuous learning.

Core Design Challenge How do you make AI decisions visible, auditable, and trustworthy in enterprise workflows? Most AI chat interfaces feel like a "black box" — users can't tell what the AI actually did, where the data came from, or why it made certain recommendations. This case study explores transparency patterns that build justified trust without overwhelming users.

LLM Boundary (Critical Architecture Decision) The LLM never selects supplier IDs, calculates totals, or makes approval decisions. It only interprets user intent and generates human-readable summaries. All "decisions" flow through deterministic tool agents with explicit rule IDs — making every action auditable and reproducible.

My Role: UX Design + Prototyping

What I designed:

Agent orchestration architecture (LLM boundaries, tool separation)
Transparency pattern system (Source Chips, Agent Actions, Rule IDs)
Information architecture and conversation flow design
Trust measurement framework and KPIs

AI-assisted execution: The working prototype was built with Claude Code, allowing rapid iteration. I directed the design decisions; AI accelerated the implementation.

Full 3-minute demo walkthrough

This design exploration followed a diverge-converge pattern, balancing rapid prototyping with structured validation.

Week 1-2

Research & Framing

Competitive analysis of AI procurement tools
Enterprise AI trust pattern research
Defined core design questions

Week 3-4

Architecture Design

Agent orchestration model
Service blueprint mapping
LLM boundary definition

Week 5-8

Prototyping & Iteration

AI-assisted UI development
Transparency pattern exploration
3 major design iterations

Ongoing

Documentation

Pattern library creation
Case study writing
Measurement framework

Collaboration

While this is a solo design exploration, I simulated a cross-functional process: I wore multiple hats as product thinker (defining scope and priorities), domain researcher (procurement workflow analysis), and technical architect (API constraints and AI capabilities). In a production setting, this work would involve collaboration with procurement SMEs for domain validation, engineering leads for feasibility review, and compliance teams for audit requirements.

Key design decisions & trade-offs

Why separate LLM from deterministic tools?

Considered: End-to-end LLM with function calling for everything.
Chose: LLM for interpretation only; rule-based agents for decisions.
Rationale: Enterprise auditors need reproducible decision paths. "The AI decided" isn't acceptable for procurement compliance.

Why collapsed-by-default for Agent Actions?

Considered: Always-visible action log (like terminal output).
Chose: Progressive disclosure — collapsed by default, expandable on demand.
Rationale: 80% of users trust the summary; 20% want to verify. Don't punish the majority with information overload.

Why Source Chips instead of footnotes?

Considered: Traditional citations at bottom of messages.
Chose: Inline colored chips next to each data point.
Rationale: In high-frequency workflows, glanceability beats thoroughness. Users scan, not read.

Each decision was documented with alternatives considered — making it easier to revisit choices as requirements evolve.

Before designing the UI, I mapped out how different agents coordinate in this workflow. The principle: "Make the agent's reasoning available on demand."

User Intent

→

Copilot (LLM)
Interpret intent · Generate summary · Ask clarifying questions

↓

Tool Agents (Deterministic)
Catalog lookup · Totals calculation · Vendor validation · Policy checks

↓

Rules Engine
Routing decisions · Rule IDs · Thresholds

Risk Signal Agent
Confidence flags · Missing info · Anomaly detection

↓

Human Handoff

Low confidence · Policy conflict · Exception request

Agent responsibilities

Copilot (LLM)

Interprets user intent, generates human-readable summaries, and asks clarifying questions when input is ambiguous. Never makes decisions.

Tool Agents

Deterministic functions: catalog lookup, price totals, vendor validation. Each call logged with inputs/outputs for audit.

Rules Engine

Evaluates business policies and returns routing decisions with explicit rule IDs (e.g., AMOUNT-001). No ML uncertainty.

Risk Signal Agent

Flags low-confidence matches, missing required fields, and unusual patterns. Triggers human review when thresholds exceeded.

This layered architecture ensures the LLM handles interpretation while deterministic systems handle decisions—making the workflow auditable and enterprise-safe.

System flowchart: Happy Path vs. Conflict Path

This diagram maps the complete workflow, showing how the system handles both successful requests and edge cases requiring human intervention.

Before prototyping, I mapped the end-to-end journey showing frontstage interactions (user touchpoints), backstage processes (agent actions), and support systems (rules engine, audit log).

User Actions

Describe need → Select items → Fill details → Submit PR

Frontstage

Copilot response → Item cards → Validation UI → Routing banner

Backstage

Intent parsing → Catalog lookup → Rules check → Risk scoring

Support

LLM API Catalog DB Rules Engine Audit Log

This blueprint informed the 4-layer agent model and identified where AI transparency patterns (Agent Actions, Source Chips) should appear in the flow.

In most AI chat interfaces, the assistant's response looks like "LLM talking"—users can't tell:

What the AI actually did (Did it look up the catalog? Check inventory?)
Where the data came from (Is this from our approved vendor list?)
Why decisions were made (Why does this need Finance approval?)

"The AI just says 'I found 2 items'—but I can't verify if it checked our preferred vendors."

— Common user concern (pattern)

"If this goes to audit, how do I prove the AI followed our procurement policies?"

— Common governance concern (pattern)

This is the "black box" problem of agentic AI: powerful capabilities hidden behind opaque responses.

Who feels the pain?

Requesters (Employees)

Fear of procurement forms — unsure which supplier to select, which cost center applies, or what "commodity code" means. They end up emailing procurement for help, creating bottlenecks and delays for simple requests.

Approvers (Managers)

Overwhelmed by dense forms with 20+ fields. Can't quickly identify if a request exceeds budget, violates policy, or uses a non-preferred vendor. Often approve without full understanding due to time pressure.

Buyers (Procurement)

Time wasted on preventable issues: rejected POs due to incomplete data, returns from wrong supplier selection, chasing requesters for missing information. Strategic sourcing takes a back seat to firefighting.

The AI Copilot fundamentally changes how procurement requests flow through the organization. Here's the comparison:

Before: Traditional Flow

1 User opens complex procurement form (20+ fields)

2 Emails procurement team for help with vendor codes

3 Waits 1-2 days for response

4 Submits form, rejected due to missing cost center

5 Resubmits, waits in approval queue

6 Approver skims form, approves without full review

Average time: 3-5 days

After: AI-Assisted Flow

1 User says "I need a laptop for new hire"

2 Copilot searches catalog, shows options with Evidence Chips

3 User selects, Copilot asks clarifying questions

4 Auto-validates against policies, flags issues

5 Approver sees summary card with risk highlights

6 One-click approval with full Audit Trail

Average time: ~15 minutes

The transformation isn't just about speed—it's about shifting cognitive load from users to the system while maintaining human control over decisions.

Beyond single-turn interactions, I mapped out multi-turn flows and designed explicit fallback and escalation patterns for high-risk scenarios.

Simplified dialogue flow

User: "I need a laptop"

→

Tool: Catalog lookup

→

Copilot: "Found 2 options..."

→

User: Select item

→

Missing data?

→

Clarifying Q

→

Rules check

→

✓ Submit PR

High-risk scenarios & escalation patterns

For each scenario, I designed specific fallback and recovery flows:

SCENARIO 1

Missing required data

Example: User doesn't specify cost center

Detect missing field → Clarifying question → Smart UI selector → Evidence logged

SCENARIO 2

Low confidence match

Example: Ambiguous supplier or SKU selection

Show source chips → Display confidence % → Force user confirm → Selection logged

SCENARIO 3

Policy conflict / Exception needed

Example: Request exceeds budget or requires special vendor

Show rule ID → Explain conflict → Offer "Request Exception" → Human handoff

Each pattern maps UI components I've already designed—validation checklist for missing data, source chips for low confidence, routing banner for policy conflicts—into a coherent conversational system.

The core design tension in agentic AI interfaces:

Too little visibility → users don't trust the AI, hesitate to adopt, and escalate to humans unnecessarily
Too much information → cognitive overload, slower workflows, and users start ignoring the transparency features entirely

After exploring several approaches (always-visible agent logs, expandable side panels, inline annotations), I framed my design goal as:

"Make the agent's reasoning available on demand, not in the way."

This led to a principle: progressive disclosure of agent actions. The key insight is that trust comes from knowing you can verify, not from being forced to verify every time. Show evidence when users need it (verification, audit, debugging), hide complexity when they don't (routine tasks, trusted flows).

Under each assistant message, I added a collapsible "Agent Actions" panel that shows:

Tool name — what the agent did (e.g., "Lookup Catalog Items")
Result summary — outcome (e.g., "Matched 2 items from Engineering standard kit")
Evidence refs — data IDs for audit trail (e.g., SKU:LAPTOP-001)

Collapsed by default · Expandable on demand · Shows tool execution sequence

Design decisions

Collapsed by default — doesn't interrupt the conversation flow
Count badge — "Agent Actions (3)" signals there's content without revealing it
Status icons — green checkmarks for success, red X for errors

This panel surfaces deterministic tool execution logs (function calls + outputs) rather than free-form LLM explanations.

For line items, I designed source chips — color-coded badges that show where each piece of data came from. This addresses a common concern: "The AI says this supplier is good, but how do I know it actually checked our approved vendor list?"

Catalog — item found in the company's approved catalog (with SKU reference)
Standard kit — matches department standard equipment list (e.g., "Engineering standard laptop")
Preferred — vendor is on the preferred/contracted supplier list (with contract ID)
In stock — available in company warehouse (with location and quantity)

Source chips with hover tooltips showing reference IDs

Why this matters

Instead of text like "Matches Engineering standard kit" (which could be LLM hallucination), chips provide:

Scannable at a glance — color-coded chips are faster to scan than reading full sentences, allowing users to quickly assess compliance status for multiple line items
Audit-ready evidence — hovering reveals reference IDs (SKU, contract number, vendor ID) that can be traced back to source systems for audit verification
Honest attribution — users know exactly where each piece of data originated, building justified trust rather than blind faith in AI outputs

After submission, a routing banner appears at the top of the PR, showing the complete approval path and explaining why each approver is required. This addresses a common frustration: "Why does this simple request need Finance approval?"

Approval path: Manager → Finance · Each routing reason has a traceable rule ID

Rule-based transparency

Each routing reason includes a rule ID (e.g., AMOUNT-001, VENDOR-001) that maps directly to documented business policies. This design choice has several important implications:

Users understand "why" — instead of opaque routing, users see exactly which policy triggered each approval requirement (e.g., "AMOUNT-001: Orders over $5,000 require Finance approval")
Auditability by design — every routing decision can be traced back to a specific rule, making compliance verification straightforward for auditors
No "AI magic" — the system is fundamentally a rules-driven workflow surfaced through a conversational UI. The LLM helps users navigate the workflow, but deterministic rules make the actual decisions

Traditional form validation shows "40% complete" — but users don't know what's actually missing or what information they need to gather. This creates a frustrating guessing game that slows down submissions.

I designed a validation checklist that explicitly lists all required fields with their current status. The copilot can also help fill in missing information through conversation:

At least one line item
○ Cost center — "Ask your department admin or check your previous PRs"
○ Need-by date — "When do you actually need this?"
○ Delivery location — "Office address or building code"
○ Business reason — "Brief justification for the purchase"

Validation checklist shows exactly what's missing — click any item for guidance

This transforms "Submit disabled" from a frustrating dead-end into actionable guidance. Users know exactly what they need to provide, and the copilot can help them find or fill in the information through natural conversation.

Beyond the chat interface, I designed four additional screens to cover the complete PR→PO lifecycle. Each screen is tailored to a specific user role and task, ensuring the right information is surfaced at the right moment.

Chat + Draft Entry (Requester)

Split-panel layout: left side for natural language conversation with the copilot, right side shows the real-time PR draft with validation checklist, policy compliance hints, and remaining budget. Users can chat or directly edit fields — both paths sync automatically.

Manager Approval Card (Approver)

AI-generated summary card with risk tags (budget impact, policy flags, unusual patterns), key facts at a glance (requester, total amount, need-by date), and clear CTAs: Approve / Reject / Request Info. Designed for mobile-first approval workflows.

Buyer PO Workbench (Procurement)

Batch processing view for procurement professionals. Select multiple approved PRs, receive copilot suggestions for order consolidation, alternate suppliers with better pricing, or contract-compliant substitutions. Override AI recommendations with mandatory reason capture.

Audit Timeline (Compliance)

Complete audit trail: every action logged with timestamps, user IDs, and AI involvement markers. Shows who did what, when, and what AI assistance was used. Searchable by rule ID, user, date range, or action type. Exportable for compliance reporting.

Workflow Admin (Backoffice)

Canvas-based workflow builder for configuring agent orchestration. Administrators can drag-and-drop agent nodes, connect them with conditional logic, set confidence thresholds for human escalation, and configure business rules without code changes. View interactive prototype →

When building AI prototypes, it's tempting to hardcode impressive results. I set constraints for myself:

Hardcode rules, not results

The routing logic checks actual business rules (amount thresholds, supplier status), not a pre-determined path. This makes the prototype honest about what's deterministic vs. what would need real AI.

Mark assumptions clearly

Where I used simulated data, I added "Prototype rule" badges. This prevents stakeholders from overestimating what the demo can do.

Show evidence, not claims

Agent Actions shows actual function execution, not LLM-generated descriptions. The difference matters for trust and audit.

UX decisions for enterprise adoption

Reduce "form fear"

Use conversational guidance instead of overwhelming forms. Instead of presenting 20+ fields at once, the copilot asks targeted questions and auto-fills where possible. Users see only decision-relevant information at each step, with the full form available for power users.

Decisions happen in-context

Approvers see budget impact, risk signals, and policy flags on the same card — no switching between ERP screens to gather context. AI-generated summaries highlight what matters, with full details expandable on demand.

Clear AI boundaries

The LLM interprets user intent and generates human-readable summaries; deterministic rules engines make decisions with explicit rule IDs. This separation makes the system auditable, reproducible, and reduces the risk of LLM hallucination affecting critical business outcomes.

To enable clients and internal teams to adopt these patterns, I documented decision criteria and templates that can be used in discovery workshops, pitch decks, and implementation guides.

When to use LLM vs deterministic

LLM: Intent parsing, summarization, natural language generation
Deterministic: ID lookups, calculations, policy enforcement

Confidence thresholds

>90%: Auto-proceed with evidence
70–90%: Show options, require selection
<70%: Force user confirmation

Standard escalation triggers

Policy rule conflict detected
Multiple low-confidence matches
User explicitly requests exception
Amount exceeds auto-approval threshold

Transparency UI patterns

Agent Actions Panel (collapsible)
Source Chips (color-coded provenance)
Rule IDs in routing banners
Validation checklists (not %)

Sample prompt patterns

                        // Clarifying question template

                        "I found {count} matching items. To proceed, I need:

                          • {missing_field_1}

                          • {missing_field_2}

                        You can select from the options below or type your answer."

                        // Policy conflict template

                        "This request triggers rule {rule_id}:
                        {rule_description}.

                        You can:

                          1. Modify the request to comply

                          2. Request an exception (requires additional approval)"

Responsible AI checklist

✓ No PII in prompts ✓ Audit trail for all decisions ✓ Human-in-the-loop for high-risk ✓ Keyboard-navigable UI ✓ Screen reader labels ✓ Color not sole indicator

Note: These are hypothetical projections based on industry benchmarks, not measured data from a production deployment. The purpose is to illustrate the types of outcomes this design aims to enable.

5→2

days
PR→PO cycle time

30%→10%

reduction
rejection rate

~20%

reduction
maverick spending

How these estimates are derived

Cycle time reduction: Conversational UI + auto-filled fields reduces form friction; AI-surfaced risk signals enable faster approvals.
Rejection rate: Validation checklist catches missing fields before submission; source chips ensure correct supplier/SKU selection.
Maverick spending: Industry data suggests tail spend can represent up to 20% of procurement costs. Policy enforcement at creation time reduces off-contract purchases.

AI systems fail. The design question isn't if but how gracefully. I mapped out failure scenarios and designed explicit fallback patterns for each:

STALE DATA

Price or inventory data outdated (>30 days)

Detection

Timestamp check on vendor data

UI Signal

"Price may have changed" warning chip

Fallback

Require re-quote before PO generation

POLICY CONFLICT

Multiple rules match with conflicting outcomes

Detection

Rules engine returns >1 blocking rule

UI Signal

Show both policies with rule IDs

Fallback

Escalate to compliance officer for resolution

NO MATCH

No approved vendor found for item category

Detection

Empty results from vendor lookup tool

UI Signal

"No preferred vendors" message + suggestions

Fallback

Offer "Expand search" or manual vendor entry

LOW CONFIDENCE

LLM uncertainty exceeds threshold

Detection

Intent classification confidence <70%

UI Signal

Don't auto-fill; show "I'm not sure" prompt

Fallback

Ask clarifying question or offer options

The principle: when the system doesn't know, it should say so — and provide a clear path forward rather than guessing.

In agentic workflows, humans need control without friction. I designed specific mechanisms to balance AI efficiency with human oversight:

Override mechanism design

Buyer Override

Change AI-suggested vendor with mandatory reason selection: Better price, Existing relationship, Faster delivery, Other

Evidence Capture

Override reason + original suggestion stored in audit log. Enables pattern detection across users.

Learning Signal

High override rate for a vendor → flag for catalog review. Human corrections inform system improvement.

When humans must intervene

REQUIRED

Exception requests — When PR violates policy and user requests override

REQUIRED

Non-catalog items — Manual vendor entry requires procurement review

RECOMMENDED

High-value orders — Above threshold, system recommends (doesn't require) review

OPTIONAL

Any time — User can always click "Talk to procurement" to escalate

The goal: humans can override at any point, but must override only when the system explicitly requires it and only in high-risk scenarios.

To validate that transparency patterns actually build trust, I defined a measurement framework with leading indicators (behavior) and lagging outcomes (business impact).

North Star Metric

Target

90%+

PO Approval Rate without Buyer Override

Trust metrics

Metric	Baseline	Target	Why it matters
Evidence Coverage	~60%	95%+	Every line item has source attribution
Override Rate	~30%	<10%	Low override = AI suggestions are trusted
Source Chip Click Rate	n/a	40%+	Users verify sources = transparency works
Time to Approve	15 min	<5 min	Trust + clarity = faster decisions
PO Return Rate	~20%	<5%	Validation catches errors before submission

Event tracking plan

To measure these, I defined specific events to instrument:

expand_agent_actions click_source_chip override_vendor request_requote escalate_to_human view_routing_reason submit_with_warning

Success threshold: If evidence coverage reaches 95%+ and override rate drops below 10%, we can conclude that transparency patterns are building justified trust — not blind trust.

To continuously improve conversational AI experiences, I defined metrics that track both task success and user trust signals.

KPIs for prototype evaluation

Metric	What it measures	Signal
Task success rate	PR submitted → PO generated	Higher = better
Correction loop rate	Clarifying exchanges per session	Monitor trend
Escalation rate	Human handoffs + reason breakdown	Investigate high
Policy conflict rate	Top 5 rules triggered	Inform policy review
Time-to-complete	Intent → Submit duration	Lower = better
User trust signals	Actions panel expand rate, source chip hover rate	Qualitative insight

Learning from conversation logs

Structured logs enable continuous improvement:

Top intents: What do users ask for most? Are prompts optimized for these?
Failure points: Where do sessions drop off? Which fields cause most clarifying questions?
Prompt iteration: A/B test response templates based on correction loop data
Policy friction: Which rules trigger exceptions most often? Flag for policy review

Even in a prototype, designing for measurability creates a foundation for production iteration.

Key insights

Progressive disclosure is key for AI transparency

Users don't need to see every tool call—but they need to know they can see it. The collapsed-by-default pattern respects attention while enabling verification.

Color coding beats text for scanability

Source chips communicate data provenance faster than sentences. In high-frequency workflows, visual differentiation reduces cognitive load.

Rule IDs make AI auditable

Connecting every AI decision to a numbered business rule transforms "the AI decided" into "rule AMOUNT-001 triggered." This is the difference between magic and accountability.

Challenges encountered

Balancing transparency with overwhelm

Early iterations showed too much — every API call, every rule check. Users felt bombarded. The breakthrough was realizing transparency is about access to information, not forcing information. Progressive disclosure solved this.

Defining the LLM boundary

Initial designs let the LLM do more — including supplier selection. But every "AI decision" created an audit problem. The hardest part was convincing myself to restrict the AI's role, even though it could technically do more.

Designing without real users

As a design exploration, I couldn't validate with actual procurement teams. I compensated by studying published enterprise AI research and designing explicit measurement frameworks for future validation.

If I did this again

Start with user research: Even informal interviews with procurement professionals would have grounded assumptions earlier.
Build a simpler first version: I over-invested in the prototype UI before validating core patterns. A lower-fidelity test would have been more efficient.
Document failures earlier: Some of the most valuable learnings came from abandoned approaches, but I only documented them late in the process.

Accessibility & responsive considerations

Accessibility

Color coding supplemented with icons/labels (not color-only)
Progressive disclosure respects screen reader flow
High contrast ratios for Source Chips and status indicators
Keyboard-navigable expandable sections

Responsive Design

Chat interface scales for tablet/desktop
Collapsible panels for smaller viewports
Touch-friendly action buttons and chips
Mobile-first approach for future iteration

This project taught me that designing AI interfaces isn't about showing what the AI can do — it's about showing what the AI did, and giving users the confidence to act on it.

Designing Transparency into Agentic AI

From "AI magic" to enterprise clarity

Three user roles, one unified experience

How I approached this project

Week 1-2

Week 3-4

Week 5-8

Ongoing

Collaboration

Key design decisions & trade-offs

Why separate LLM from deterministic tools?

Why collapsed-by-default for Agent Actions?

Why Source Chips instead of footnotes?

Orchestration model

Agent responsibilities

Copilot (LLM)

Tool Agents

Rules Engine

Risk Signal Agent

System flowchart: Happy Path vs. Conflict Path

From agent diagram to service blueprint

Users don't trust what they can't see

Who feels the pain?

Requesters (Employees)

Approvers (Managers)

Buyers (Procurement)

Before vs After

Before: Traditional Flow

After: AI-Assisted Flow

Conversation patterns

Simplified dialogue flow

High-risk scenarios & escalation patterns

Missing required data

Low confidence match

Policy conflict / Exception needed

Show the agent's work without overwhelming users

Agent Actions Panel

Design decisions

Source Chips

Why this matters

Routing Banner with Reasons

Rule-based transparency

Validation Checklist

5 screens for the full workflow

Chat + Draft Entry (Requester)

Manager Approval Card (Approver)

Buyer PO Workbench (Procurement)

Audit Timeline (Compliance)

Workflow Admin (Backoffice)

Honesty in prototyping

Hardcode rules, not results

Mark assumptions clearly

Show evidence, not claims

UX decisions for enterprise adoption

Reduce "form fear"

Decisions happen in-context

Clear AI boundaries

Client-Ready Playbook

When to use LLM vs deterministic

Confidence thresholds

Standard escalation triggers

Transparency UI patterns

Sample prompt patterns

Responsible AI checklist

Hypothetical business outcomes

How these estimates are derived

Failure modes & fallbacks

Price or inventory data outdated (>30 days)

Multiple rules match with conflicting outcomes

No approved vendor found for item category

LLM uncertainty exceeds threshold

Human-in-the-loop controls

Override mechanism design

Buyer Override

Evidence Capture

Learning Signal

When humans must intervene

Trust Scorecard

North Star Metric

Trust metrics