AI Evals for Rag & Agentic Workflow
AI Evals in Detail
We have published 18th Lecture on AI Product Management
( Subscribe to our YouTube Channel )
We are currently living in a golden age of software production. With Generative AI, the speed at which we can churn out software—and the capability to solve complex use cases like chatbots—is unprecedented.
However, there is a catch. While building an AI demo is easy, making it reliable is incredibly hard. We all know Large Language Models (LLMs) have knowledge cutoffs, but they are also nondeterministic. If you ask an LLM, “How many A’s are there in the word banana?” 99 times, it might be correct. On the 100th time, it might fail.
If you are trying to move an AI application from a “vibes-based” MVP to a robust production system, you cannot rely on manual spot-checks. You need Evals.
This guide breaks down how to build evaluation frameworks for standard LLM apps, Retrieval-Augmented Generation (RAG) systems, and Agentic workflows.
Why “Unit Tests” Don’t Work for AI
For decades, software engineers have relied on unit tests. These work perfectly for deterministic systems—situations where every input has a single, defined output. For example, if a user enters a wrong password three times, the system must lock the account. This is binary.
AI systems are nondeterministic. A single input can have multiple valid outputs. If you ask an AI to summarize a news article about a “Gujarati test match,” valid summaries could range from “The test will have tea before lunch” to “Early sunsets prompted a schedule change.” Both are correct, but they look completely different.
Because we cannot write simple assertion tests, we use Evals (Evaluations)—structured tests that measure quality, reliability, and correctness across different scenarios.
The 3-Step Framework for Building Evals
To move away from checking outputs manually, you need a systematic approach.
Step 1: Error Analysis with Subject Matter Experts (SMEs)
You cannot automate what you do not understand. The first step is to have a Subject Matter Expert (SME) review real interactions between users and your AI.
The SME should provide detailed commentary on why a response was good or bad.
Good Handling: A user asks about a failed transaction. The AI pinpoints the transaction ID and gives an ETA.
Bad Handling: A user asks for stock recommendations. The AI recommends three specific stocks. This is a failure because, in this specific investment platform context, regulations might prohibit giving financial advice.
Step 2: Axial Coding (Finding the Patterns)
Reviewing logs indefinitely is not scalable. You should stop manual review once you stop making new discoveries—usually around 100 samples.
Next, apply Axial Coding. This is a qualitative method where you group raw observations into broader categories. You can actually use an LLM to do this for you. Feed the SME’s commentary to an LLM and ask it to categorize the failures into buckets (e.g., “Regulatory Non-Compliance,” “Operational Errors,” “Hallucination”).
Create a pivot table to see which categories cause the most failures. If “Regulatory Non-Compliance” is your biggest failure point, that is where you must focus your automated evals first.
Step 3: The “LLM-as-a-Judge”
Once you know what to test, you can build a custom Eval using an LLM as the judge.
You write a prompt that defines the judge’s persona and the specific failure framework. For example, if you are testing for “Handoff Deficiency,” you provide the judge with real examples of where the bot failed to hand off to a human.
Calibrating the Judge: To ensure your AI judge is reliable, run it on a fraction of logs and compare its verdict against a human’s verdict using a 2x2 Matrix:
Top Left (Consensus Good): Both Human and AI agree the response was good.
Bottom Right (Consensus Bad): Both agree the response was bad.
Top Right (Judge Harshness): The AI flags good responses as bad (needs fixing).
Bottom Left (Judge Leniency): The AI lets poor responses pass (critical fix needed).
Evals for RAG Systems
If you are building a RAG (Retrieval Augmented Generation) system, a simple pass/fail eval isn’t enough because you won’t know where the error occurred.
Think of RAG like cooking a pasta dish:
The Retriever is the Shopper: Their job is to buy the ingredients.
The Generator is the Chef: Their job is to cook the meal using those ingredients.
If the Shopper brings the wrong ingredients, the Chef cannot make the right pasta. Similarly, if your Retriever fetches irrelevant documents, the Generator will fail. You must evaluate them separately.
The Metric: Recall@K
To evaluate the Retriever, we use Recall@K. This measures the ratio of relevant documents found in the top “K” results.
Why not just set K to 100? You might be tempted to retrieve many documents to ensure you get the right answer. However, feeding an LLM too much information can increase hallucinations and lower accuracy.
The Sweet Spot: A general rule of thumb is aiming for 80% recall at K=5.
Evals for Agentic Workflows
Agents are distinct because they execute a sequence of actions. Just knowing the agent failed to book a flight doesn’t tell you how to fix it.
The Metric: Transition Failure Matrix
To debug agents, you need a Transition Failure Matrix. This tracks the movement between states.
For example, your matrix might show that the agent successfully transitions from “Classify Intent” to “Call API,” but fails frequently between “Call API” and “Make Payment”.
This allows you to pinpoint “hot spots.”
Is the Fetch Flight Info step failing? It might be authentication errors or rate limits.
Is the Payment step failing? It might be missing KYC fields or data formatting issues.
By implementing structured error analysis, axial coding, and architecture-specific metrics like Recall@K or Transition Failure Matrices, you can systematically improve your AI’s reliability and trust.
For Full Detailed cases Studies and AI & Strategy — Download this Book ( 5/5 Rated )
Resources
Best PM Interview Mastery Course ( First Principle Thinking Focused )
Free Resources
About Author
Shailesh Sharma! | LinkedIn I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. For more, check out my Live cohort course, PM Interview Mastery Course, Cracking Strategy, and other Resources




Wow, the challenge of 'vibes-based MVP' to robust production is so true, making me wonder how we truely measure reliability in nondeterministic systems, an excellent perspective.