Work

Experience

Work

Experience

Evatt AI - Legal Research Interface

Designing trust into AI for sceptical lawyers

AI Product Design · Design System · Chat Interface · Legal Tech

“Lawyers do not trust black boxes. When an AI gives you a legal answer that could win or lose a case, a spinner is not enough. I redesigned the entire experience around one principle: trust must be built before the answer appears.”

The Trust Problem in Legal AI

Most AI chat interfaces are built for speed. Ask a question, get an answer, move on. That model works for consumer products. It fails completely in a legal context.

Role

Lead Product Designer

Timeline

Rapid Prototype Sprint

Platform

Web App

Type

AI-Native Product

A senior partner at a law firm has spent 30 years building professional judgment. An AI tool that cannot show its work will be ignored, regardless of how accurate its output is. Worse, if a lawyer relies on an AI-generated legal position that turns out to be based on an amended statute or an overruled case, the consequences are not a bad recommendation. They are professional liability, client harm, and reputational damage that cannot be undone.

The brief was to design a chat interface for Evatt AI, an Australian legal research tool. The real design challenge was not the interface. It was trust architecture.

What legal AI looks like today

Most current legal AI tools follow the same pattern: a chat input, a response, and a list of sources appended at the bottom. The AI appears confident. The sources are often unchecked. The lawyer has no way to see how the answer was reached without manually verifying every citation.

For a junior associate working fast, this is acceptable. For a senior partner with a client's case on the line, it is unusable.

“The problem is not whether the AI is right. The problem is whether the lawyer can trust that it is right.”

Who I was designing for

Two users with fundamentally different relationships to AI trust. Their conflict shaped every decision in this interface.

From two people to three decisions

James and Priya represent opposite ends of the same spectrum. James will not use a tool he cannot explain to a client. Priya will not use a tool that slows her down. Designing for both simultaneously is the tension that shaped every decision in this interface.

James's scepticism revealed the first problem: AI thinking must be visible before the answer appears. Not a spinner. A transparent research process he can watch and evaluate in real time.

Priya's need for speed revealed the second problem: verification cannot require extra clicks. The evidence panel must always be present, always populated, always one eye movement away. She scans it fast. James reads it carefully. Neither compromises.

Together they revealed the third problem: the visual language must feel like the best law firm in Sydney, not a consumer app. A premium, restrained interface that communicates calm authority.

“Three trust problems. Three design decisions. Every pixel in this interface exists to answer one of them.”

85% built by AI. 100% directed by me.

This project is an honest demonstration of how I work in 2026. I did not use AI to generate a few layout suggestions. I used it to build the entire system layer.

AI handled:

Full primitive token generation across navy, neutral, green, amber, and red scales
Semantic token mapping from primitives to component-level aliases
Variable collection architecture in Figma, 66 variables total
Initial component scaffolding with auto-layout
Spacing scale generation on 4px base grid
Radius token sequence
Accessibility ratio calculations for all color pairings

I handled:

Design philosophy and visual language direction
All three core product decisions
Every semantic token override where AI got it wrong
Quality validation of every component against the design principles
WCAG compliance decisions and failure documentation
Typography system selection and rationale
The streaming state interaction model
Developer handoff rules and documentation

“AI is fast at generating systems. It is slow at understanding people. I used it for the former and kept the latter entirely for myself.”

Where I overrode the AI

One specific example. AI mapped the warning color directly to text tokens. Amber text on a white background fails WCAG AA at normal text sizes. I overrode the entire semantic layer for warning states. Amber is reserved for icon and border only. All text inside warning states uses slate-900. This distinction is invisible to a user who never encounters an accessibility failure. It is the difference between a system that is accessible and one that merely looks accessible.

Claude code & MCP connector

Automated the entire Figma design system build tokens, variables, semantic layers, and component scaffolding. Used Figma MCP to create and wire 66 variables programmatically, eliminating hours of manual token setup.

Figma

Primary design canvas for all UI work. Used to build components, screens, and the complete design system with auto-layout and variable collections.

ChatGPT

Used for research synthesis, writing assistance, and generating initial content frameworks. Helped draft persona narratives, design rationale, and iteration notes throughout the project.

Building the foundation first

Before designing a single screen I built the token architecture. Every color, spacing value, radius, and typography decision encoded as a variable before any component was touched. This means the system scales without breaking consistency and a developer never needs to ask a designer what color a border should be.

Three metric callouts

66 Figma Variables - Primitive and semantic layers fully separated

4px Base Grid - All spacing values are multiples of 4

WCAG 2.1 Full Documentation - Every color pairing ratio calculated and documented

The accessibility decision I am most proud of

Most design systems document their passes. I documented the failures too.

Color/text/muted on any surface fails WCAG AA. Color/status/warning on white fails. These are not oversights. They are honest constraints with explicit usage rules: muted text is for placeholder copy only, never body content. Warning color is for icons and borders only, never standalone text.

A design system that hides its failures is a liability. One that documents them with clear rules is a tool engineers can trust.

Design System

Components

Three decisions that shaped everything

Each decision connects directly back to a trust problem identified in the personas. None of them are visual decisions. They are product decisions expressed through visual design.

01 - Making AI thinking visible

A spinner communicates one thing: wait. It tells you nothing about what is happening, how long it will take, or whether the system is working correctly. For a senior partner about to rely on AI-generated legal research, a spinner is a black box. And a black box destroys trust before the answer appears.

I looked at how other high-stakes domains handle process transparency. Operating theatres added observation windows not because families needed to watch surgery but because being able to watch made the process feel trustworthy. The same principle applies here.

The solution: 5-step visible research

Numbered visual list: 1 — Reading your query 2 — Identifying jurisdiction and practice area 3 — Searching case databases: AustLII, HCA, FCA 4 — Cross-referencing and verifying: LexisNexis, legislation.gov.au, ASIC 5 — Preparing memorandum

Each active step expands to show the exact databases being searched, legal terms identified as chips, and citations found in real time highlighted in green. James watches the AI work. Trust is built before the answer appears.

Image: Streaming state screen, large.

What I rejected: A confidence percentage shown during research. Something like 87% complete. I rejected it because a number implies false precision. A lawyer seeing 87% confident will ask: confident about what, measured how? The step-by-step process answers a more useful question: what is the AI actually doing right now?

Success metric: How quickly does a lawyer click their first source after the answer appears? Target: under 30 seconds in week 1. Faster means the streaming state built trust before the answer arrived.

02 - Evidence before everything

Legal research is not complete when the answer appears. It is complete when the answer has been verified. Most AI interfaces append sources at the bottom of a response. This treats verification as optional. For lawyers, verification is the work.

The solution: right panel always open, never collapsed

Sources tab: Court decisions with three verification states. Accent green means verified against AustLII. Amber means the citation exists but needs review. Red means not found. The left border encodes trust state so a lawyer scans the panel and understands reliability without reading a word.

Statutes tab: Acts of Parliament with a currency badge showing Current or Amended. No competitor surfaces this visibly at the point of research. A lawyer relying on a section amended in 2022 without knowing it is exposed to professional liability. The currency badge exists specifically to prevent that.

History tab: Every question asked in this matter, with a one-line conclusion and a Re-run button. When new authority is handed down, a lawyer can re-run every question in a matter and see if conclusions have changed.

Image: Right panel showing all three tabs. Place all three states side by side if possible.

What I rejected: Making the right panel collapsible to give more reading space. I rejected it because collapsible means optional. If the evidence panel is optional, lawyers will close it. Evidence must be ambient, always present, never requiring a deliberate action to access.

Success metric: How many lawyers come back to open a past matter within 4 weeks? Target: 4 in 10 by week 4. More returns means the History tab is useful enough to come back to.

03 - Nordic restraint

Legal work is conservative, high-stakes, and document-heavy. The interface should feel like the best law firm in Sydney, not a consumer app. Premium but not flashy. Calm but not boring.

Three typefaces, three jobs

IBM Plex Sans for all UI text. Designed for enterprise software. Highly legible at small sizes. Signals technical credibility without feeling cold.

Libre Baskerville for display moments only. A serif with legal tradition. The empty state greeting "What's on your desk, Counsellor?" is the one moment of warmth in an otherwise restrained interface. It tells the lawyer this tool understands their world.

DM Mono for all citations. Monospace communicates precision and immutability. A citation in a proportional font looks like editorial copy. In DM Mono it looks like a reference number in a court document.

Message hierarchy without aggression

User messages are navy, right-aligned, bottom-right corner squared to 2px. AI responses are borderless prose with a 2px accent blue left rule. Identical font size throughout. Spatial position and the left rule encode the conversational role, not typography weight. This avoids the visual aggression of heavy bold AI responses dominating the conversation.

Image: Empty state screen, full width. Then full chat interface, full width.

What I rejected: A dark theme. Legal work happens late at night and a dark interface would reduce eye strain. I rejected it because standard dark themes in legal products feel like security software. They communicate surveillance and risk. The Nordic light palette communicates clarity, precision, and calm authority.

What I would explore instead of Dark mode is Reading mode

I rejected a full dark theme for the primary interface because it sends the wrong signal for a trust-first legal product. But I acknowledge the tension. Legal work happens late at night. Long reading sessions under artificial light cause eye strain. A sceptical senior partner working at 11pm on a complex matter has a legitimate reason to want a darker environment.

The right answer is probably not a dark theme. It is a reading mode. A toggle that shifts the interface to a low-contrast, warm-toned reading surface specifically for long-form AI responses and document review, while keeping the navigation, status indicators, and verification states in the standard Nordic light palette where trust signals need maximum contrast.

This is meaningfully different from a dark theme. It is a context-aware surface switch, not a cosmetic preference.

Success metric: SUS (System Usability Scale) score target of 80 or above. Industry average is 68. Above 80 means genuinely easy to use.

Three directions I rejected

Rejected Direction 1: Inline source verification instead of a right panel. Sources placed directly beneath each AI response in the chat worked well for a single exchange. It broke down completely across a multi-turn research session where sources became scattered and impossible to reference as a set. The right panel solves this by aggregating all sources from the entire session in one place.

Rejected Direction 2: A confidence percentage during streaming. Looked impressive in early mockups. Abandoned because a number like 87% implies precision that does not exist and invites a question the interface cannot answer: confident about what, measured how?

Rejected Direction 3: A dark theme. Made the product feel like security or surveillance software. Wrong register entirely for a legal research tool that needs to communicate calm authority.

Built to hand off from day one

Every component was built with auto-layout and published to the team library. The token system means a developer never needs to ask what color a border should be. They look it up in the token. This removes an entire class of design-engineering miscommunication.

Five rules as a clean list:

All spacing is a multiple of 4. If a value is not in the token scale it does not exist.
Border width is always 0.5px. Never 1px.
Corner radius follows the token scale: sm=4, md=8, lg=12, xl=16, full=999.
Status colors are never used for text. Icons and borders only. Text always uses slate-900.
Semantic tokens only in components. Primitives are never referenced directly.

How I would measure this

Card 1: Under 30 seconds Time to first source click after the answer appears in week 1. Faster means the streaming state built trust before the answer arrived.

Card 2: 4 in 10 lawyers Return to open a past matter within 4 weeks. If lawyers come back the History tab is doing its job.

Card 3: 80+ SUS Score System Usability Scale. Industry average is 68. Above 80 means genuinely easy to use.

30s

first source click after the answer appears

4 in 10

Return to past matter within 4 weeks.

80+

SUS Score target. Industry avg is 68.

Closing reflection: what I would do next

The hardest design problem in AI is not making it look smart. It is making users feel safe enough to trust it. Evatt was an exercise in designing that trust from first principles, with no existing pattern library to borrow from and no competitor doing it well enough to learn from.

The problem I have not solved is mobile. The three-column layout works on desktop. On a smaller screen the right panel collapses and the evidence-first principle breaks. A solo practitioner working on an iPad in a court waiting room loses their evidence panel when they need it most.

If I were to continue this project, solving the mobile evidence experience is where I would start. Not as a responsive breakpoint. As a fundamentally different interaction model for a different context of use.

“Most designers make AI look smart. I was trying to make lawyers feel safe. Those are different problems with different solutions”

Let's work together!

themansoorahmad95@gmail.com

Get Custom Solution

Schedule a call with Mansoor A.

Instagram