Advanced Evals - Traces in AI Evals
How to Debug AI Systems That Think in Steps
You are a product manager at Amazon.
You just shipped Rufus. The AI shopping assistant that lives inside the Amazon app.
A user types: I am looking for running shoes for flat feet under 5000 rupees with good cushioning.
Your system does not just call an LLM and return a response. It runs a chain of operations.
First, it classifies the user’s intent. Is this a product search? A comparison request? A return query?
Then, it extracts structured attributes from the query. Category: running shoes. Foot type: flat feet. Budget: under 5000. Feature: cushioning.
Then, it calls the product search API with those attributes and retrieves 20 candidate products.
Then, it applies a reranking model to sort those 20 products by relevance to the original query.
Then, it feeds the top 5 products and the original query to an LLM, which generates a conversational response with recommendations.
Finally, it applies a safety filter to check for hallucinated claims. Did the LLM say a shoe has orthopaedic certification when the product listing never mentioned it?
Six places where something can go wrong.
The user sees the final response. It recommends three shoes. One of them is a basketball shoe. The cushioning claim on another is fabricated. The third recommendation is fine, but it costs 7200 rupees, which is above the stated budget.
Your VP asks what happened.
You look at the final output. It looks broken. But you have no idea which step broke it.
—> Was it the intent classifier?
—> The attribute extractor?
—> The search API?
—> The reranker?
—> The LLM? The safety filter?
This is where traces come in.
Why Traditional Evals Cannot Debug Multi-Step AI Systems
In the previous article, we covered RAG evals. Precision, Recall, MRR. Those metrics evaluate one specific component: the retrieval layer. They tell you whether your system pulled the right documents.
But modern AI systems are not single-component systems. They are pipelines. Chains. Agents. Multiple models calling multiple tools in sequence, where the output of one step becomes the input of the next.
Traditional evals look at the final output and ask: Was this answer good?
Traces solve this problem. A trace is a complete record of everything your AI system did to produce a single response. Every step. Every input. Every output. Every decision in that order.
If the final answer is wrong, the trace tells you exactly where the pipeline broke.
What Is a Trace?
A trace is borrowed from distributed systems. In traditional software engineering, when a user requests a web application, that request might travel through an API gateway, a load balancer, a backend service, a database, and a cache. A distributed trace records each hop, so engineers can see the full journey of a single request.
AI traces do the same thing, but for AI pipelines.
A trace represents the full lifecycle of a single user interaction with your AI system. From the moment the user sends a query to the moment the system returns a response.
A trace is made up of spans.
A span is one unit of work inside the trace. One step. One operation. One model call. One API request. One tool invocation.
Every span records four things.
What went in. The input to that step.
What came out. The output of that step.
How long did it took? The latency.
What type of operation it was. An LLM call, a retrieval step, a tool call, a function execution.
Spans are nested. A parent span can contain child spans. This creates a tree structure that shows exactly how your system executed.
This is the anatomy of a trace. A tree of spans, each recording the inputs, outputs, and timing of a single step.
Walking Through a Real Trace
Let us go back to Rufus. The user asked:
I am looking for running shoes for flat feet under 5000 rupees with good cushioning.
Here is the trace your system recorded. Six spans, in order.
Span 1: Intent Classifier
Input: I am looking for running shoes for flat feet under 5000 rupees with good cushioning.
Output: intent = product_search
Latency: 45ms
Model: Internal classifier v3
This span worked correctly. The intent is product search. No issues here.
Span 2: Attribute Extractor
Input: I am looking for running shoes for flat feet under 5000 rupees with good cushioning.
Output: {category: “running_shoes”, foot_type: “flat_feet”, max_price: 5000, feature: “cushioning”}
Latency: 120ms
Model: GPT-4o mini
This span also worked correctly. All four attributes were extracted accurately from the query.
Span 3: Product Search API
Input: {category: “running_shoes”, foot_type: “flat_feet”, max_price: 5000, feature: “cushioning”}
Output: 20 products returned. Product IDs: [A1, A2, A3, ... A20]
Latency: 230ms
Here is the first problem. When you inspect the 20 products returned, you find that product A7 is a basketball shoe. The search API matched on cushioning but ignored the running_shoes category constraint. The API treated the category as a soft filter instead of a hard filter.
You also find that product A14 costs 7200 rupees. The max_price filter did not work as a strict cutoff.
Two bugs. Both in the search API span. Neither is visible in the final output without the trace.
Span 4: Reranker
Input: 20 products from Span 3, original query.
Output: Top 5 ranked products: [A3, A7, A12, A14, A1].
Latency: 180ms
Model: Cross-encoder reranker v2
The reranker promoted A7 (the basketball shoe) to position 2 and A14 (the overpriced shoe) to position 4. The reranker matched on cushioning and boosted both products because they had strong cushioning scores.
The reranker did its job given the inputs it received. But those inputs were already contaminated by the search API.
This is a cascading failure. The search API lets in bad products. The reranker, operating on bad inputs, made the problem worse by promoting them.
Span 5: Response Generator (LLM)
Input: Top 5 products + original query
Output: Based on your requirements, here are three great options for flat feet with excellent cushioning: 1. Nike Revolution 6 (Rs 3,499) with orthopaedic-grade cushioning technology... 2. Adidas CourtSmash (Rs 4,299) with premium arch support... 3. ASICS Gel-Kayano (Rs 7,199) with superior gel cushioning...
Latency: 1,200ms
Model: Claude Sonnet Tokens: 340 input, 180 output
Multiple problems surfaced here.
The LLM included A14 (ASICS at Rs 7,199) despite the user asking for under 5000.
The LLM fabricated orthopaedic-grade cushioning technology for the Nike shoe.
That phrase does not exist in the product listing.
And the LLM recommended the Adidas CourtSmash, which is the basketball shoe (A7) that the search API should have filtered out.
Span 6: Safety Filter
Input: Generated response.
Output: Response passed. No safety violations detected.
Latency: 85ms
The safety filter checked for toxicity, PII, and explicit content. It did not check for factual accuracy against product listings. It did not catch the hallucinated orthopaedic-grade claim. It did not catch the budget violation.
The safety filter passed a response that contained two factual errors.
What the Trace Reveals
Without the trace, all you know is that the final answer was bad. With the trace, you know exactly what happened.
The search API had two bugs. The category filter was soft instead of hard. The price filter allowed products above the stated maximum.
The reranker amplified the problem. It promoted bad products because it optimised for feature match without respecting hard constraints.
The LLM hallucinated a product claim. It added orthopaedic-grade cushioning technology, which does not exist in any source data.
The LLM ignored a constraint. It recommended a product above the user’s budget.
The safety filter was incomplete. It checked for toxicity but not for factual grounding or constraint adherence.
Five distinct failure points. Three different components. Two cascading failures. One root cause (the search API) propagated through the entire pipeline.
You cannot find any of this by evaluating only the final output.
Span-Level Evals: Evaluating Each Step Independently
This is where traces and evals converge.
Traditional evals evaluate the system end-to-end. You compare the final output against a ground truth. That tells you whether the system worked, but not where it failed.
Span-level evals evaluate each span independently. You attach an evaluation metric to each span in the trace. Each step gets its own scorecard.
Let us apply this to our Rufus trace.
Eval for Span 1 (Intent Classifier)
Metric: Classification accuracy.
Ground truth: product_search
System output: product_search
Score: 1.0. Correct.
Eval for Span 2 (Attribute Extractor)
Metric: Attribute extraction F1.
Ground truth: {category: “running_shoes”, foot_type: “flat_feet”, max_price: 5000, feature: “cushioning”}
System output: Same.
Score: 1.0. All attributes correctly extracted.
Eval for Span 3 (Search API)
Metric 1: Category precision. What fraction of returned products match the requested category? 18 out of 20 products are running shoes. 2 are not. Score: 0.90.
Metric 2: Price constraint adherence. What fraction of returned products are under the stated max price? 17 out of 20 are under 5000. 3 are above. Score: 0.85.
Both scores reveal a leaky filter. Neither score would surface from an end-to-end eval.
Eval for Span 4 (Reranker)
Metric: NDCG (Normalised Discounted Cumulative Gain). Did the reranker place the most relevant products at the top?
If we define relevance as products that match ALL stated criteria (running shoes, flat feet, under 5000, good cushioning), then positions 2 and 4 in the top 5 contain products that violate at least one constraint.
NDCG@5: 0.72.
The reranker is optimising for partial relevance. It matches on some attributes while ignoring others.
Eval for Span 5 (LLM Response)
Metric 1: Faithfulness. Does every claim in the response have a source in the input products? The orthopaedic-grade cushioning technology claim has no source. Faithfulness score: 0.67.
Metric 2: Constraint adherence. Does the response respect all user-stated constraints? One product exceeds the budget.
Score: 0.67 (2 out of 3 recommendations within budget).
Eval for Span 6 (Safety Filter)
Metric: Hallucination detection rate. What fraction of factually unsupported claims were caught? The safety filter caught 0 out of 1 hallucinated claims. Score: 0.0 for factual grounding.
Now look at what you have.
A full diagnostic report. Each component was scored independently. You know that the intent classifier and attribute extractor are working perfectly. You know the search API has a filter leakage problem. You know the reranker needs constraint-aware scoring. You know the LLM has a faithfulness problem. You know the safety filter has a coverage gap.
This is an eval report built on traces. You cannot produce this without tracing your system.
The Trace-to-Eval Pipeline
Traces do not just help you debug individual failures. They create a flywheel for continuous improvement.
Here is how the pipeline works.
Step 1: Your system logs traces from production. Every user query generates a trace with all its spans.
Step 2: You sample traces. Not every trace needs evaluation. You pick a subset. Maybe 1 to 5 percent of production traffic. Maybe all traces where the user gave a thumbs down. Maybe all traces where the response latency exceeded a threshold.
Step 3: You run automated evals on the sampled traces. LLM-as-a-judge scores each span for faithfulness, relevance, constraint adherence, whatever metrics matter for your product. This is called online evaluation.
Step 4: Traces that score poorly get routed to a human review queue. Domain experts look at the trace, examine each span, and annotate where the system failed. These annotated traces become your golden dataset.
Step 5: You use the golden dataset for offline evaluation. Before shipping any change to any component, you run the new version against your golden dataset and compare span-level scores.
Step 6: The improved system goes to production. It generates better traces. Those traces get sampled and evaluated. The cycle repeats.
This is the trace-eval flywheel. Production traces become eval datasets. Eval datasets drive improvements. Improvements generate better traces. The system gets better every cycle.
Without traces, this flywheel does not exist. You cannot build a golden dataset if you do not know what each component did at each step.
End-to-End Evals vs Span-Level Evals
There is a common mistake teams make after they discover span-level evals. They stop running end-to-end evals entirely.
This is wrong. You need both. Here is why.
Span-level evals catch component failures. They tell you which step broke. But they cannot catch emergent failures. Failures that only appear when components interact.
Consider this scenario. The intent classifier outputs “product_search” correctly. The attribute extractor outputs all four attributes correctly. The search API returns 20 relevant products. The reranker ranks them well. The LLM generates a fluent response. Every span passes its individual eval.
But the final response is still bad. The LLM picked three products that are all from the same brand. The user sees no variety. The response feels like a sponsored advertisement.
No individual span failed. The failure is emergent. It exists only in the interaction between the reranker (which promoted similar products) and the LLM (which did not add a diversity constraint).
End-to-end evals catch this. They evaluate the final output as a whole. Diversity, user satisfaction, and task completion.
The framework is simple.
Use span-level evals to catch component failures. Where did the pipeline break?
Use end-to-end evals to catch emergent failures. Does the full pipeline produce good outcomes even when every component looks fine individually?
Use traces to connect the two. When an end-to-end eval catches a failure, walk the trace to find the root cause.
If you want to go deeper on Advanced Evals (Cohen’s Kappa, Matthew’s Correlation Coefficient), Evals for Agentic Architecture, AI Product Sense, AI Strategy, AI Pricing, AI Prototyping, Advanced Prompting, ML Systems, etc., check out my AI PM course (40+ Videos and 25+ Case Studies) [Certification Included]
Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details
About Author
Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here ), AI PM Resources



