Advanced Evals - Evals for RAG
A worked example of Precision@K, Recall@K, and MRR using Google AI Overviews.
You are a product manager at Google.
You just shipped AI Overviews.
The feature that puts an AI-generated answer right at the top of search results.
A user types: “Why does my iPhone battery drain fast after the iOS 26 update?”
Your system does two things.
First, it retrieves five web pages from Google’s index that it thinks are relevant.
Then, it feeds those pages to Gemini and generates a summary answer.
The answer looks clean. The formatting is right. Gemini’s language is fluent. Your VP sees it in a demo and says ship it.
But here is the question nobody in the room asked.
Were those five retrieved pages actually the right ones?
Because if your retrieval pulled garbage, Gemini just summarised garbage.
This is the problem every team building RAG systems runs into. And almost nobody evaluates it correctly.
Why Evaluating RAG Is Different From Evaluating LLMs
Traditional LLM evals test whether the model’s output is good. Did it answer correctly? Was the tone right? Did it hallucinate?
RAG evals test something upstream. They test whether the retrieval system fed the right inputs to the model in the first place.
A RAG pipeline has two components.
The retrieval layer that selects documents.
And the generation layer that synthesises an answer from those documents.
These are two separate systems. They fail in different ways. They need to be evaluated separately.
Most teams skip the retrieval evaluation entirely. They look at the final generated answer and if it is good, they assume the whole pipeline works.
That is a mistake. Because sometimes the model gets lucky. It generates a reasonable answer even from mediocre sources. And sometimes the retrieval is perfect, but the model fumbles the generation.
RAG evals separate these two failure modes. They tell you exactly where the pipeline broke.
And the retrieval layer? That is your job as a PM to get right. Because retrieval quality is a product decision.
—> How many documents to retrieve?
—> Which embedding model to use.
—> What similarity threshold to set.
These are all choices that show up in your PRD, not in a prompt.
The RAG Evaluation
Let us go back to Google. You are evaluating AI Overviews for the query: “Why does my iPhone battery drain fast after iOS 26 update.”
Your retrieval system pulls five documents. Here is what it returned, in the exact order it ranked them:
Position 1: Apple Support page on iPhone battery health settings.
Position 2: A CNET article titled “Best Android Phones With Long Battery Life in 2025.”
Position 3: A Reddit thread from r/iPhone where users share iOS 26 battery drain fixes.
Position 4: A MacRumors article covering iOS 26 release notes and known battery bugs.
Position 5: An Amazon product listing for an Anker battery case.
Now, you need a ground truth. You need to know, which documents in your entire corpus were actually relevant to this query.
Your human evaluators (or your golden dataset) say there are exactly four relevant documents in the whole index for this query:
Relevant Doc A: The Apple Support page on battery health settings.
Relevant Doc B: The Reddit thread with iOS 26 battery drain fixes.
Relevant Doc C: The MacRumors article on iOS 26 release notes and battery bugs.
Relevant Doc D: An Apple Developer Forum post about background app refresh causing battery drain in iOS 26.
So here is the picture. Your system retrieved five documents. Three of them are relevant (positions 1, 3, and 4). Two are irrelevant (positions 2 and 5). And one relevant document (the Developer Forum post) was not retrieved at all.
Let us now measure exactly how good or bad this retrieval was.
Precision@K in RAG: Are You Retrieving Junk?
Precision answers a simple question. Out of everything you retrieved, how much of it was actually useful?
The formula is:
Precision@K = (Number of relevant documents in the top K results) / K
Let us calculate it at different values of K.
Precision@1.
You look at only the top result.
Position 1 is the Apple Support page. That is relevant.
Precision@1 = 1/1 = 1.0
Perfect. Your top result is a hit.
Precision@3.
You look at the top three results.
Position 1: Apple Support page. — Relevant.
Position 2: CNET Android article. — Not relevant.
Position 3: Reddit iOS 26 thread. — Relevant.
Precision@3 = 2/3 = 0.67
Two out of three were useful. That CNET Android article diluted the quality.
Precision@5.
You look at all five results.
Position 1: Relevant.
Position 2: Not relevant.
Position 3: Relevant.
Position 4: Relevant.
Position 5: Not relevant.
Precision@5 = 3/5 = 0.60
Three out of five. 60%. That means 40% of what you fed to Gemini was noise.
Here is what this metric tells you as a PM.
A precision of 0.60 at K=5 means your context window is 40% garbage. Gemini has to work harder to ignore the Android article and the Anker battery case listing. Every irrelevant document increases the chance of a confused, diluted, or hallucinated answer.
If your precision is dropping, you need to look at your embedding model. Your similarity threshold is too loose. You are retrieving documents that are only tangentially related to the query.
Precision is a purity metric. It tells you whether your retrieval has a noise problem.
Recall@K in RAG: Are You Missing Important Documents?
Recall asks the opposite question. Out of everything that should have been retrieved, how much did you actually find?
The formula is:
Recall@K = (Number of relevant documents in the top K results) / (Total number of relevant documents in the corpus)
We said there are four relevant documents total. Let us calculate.
Recall@1
You retrieved one document. It is relevant.
Recall@1 = 1/4 = 0.25
You found one out of four relevant documents, 25%. You are missing 75% of the useful information.
Recall@3
The top three results contain two relevant documents (positions 1 and 3).
Recall@3 = 2/4 = 0.50
You have found half the relevant information. Better. But the user is still missing context about the iOS 26 release notes and the developer forum post.
Recall@5
All five results contain three relevant documents.
Recall@5 = 3/4 = 0.75
75%, You captured most of the relevant information. But that fourth document, the Developer Forum post about background app refresh, never made it in.
And that missing document? It might have been the most important one. It contains the actual technical fix.
A user reading the AI Overview gets generic advice on battery health settings but misses the specific step to disable background app refresh in iOS 26. That is a coverage gap. And it is invisible if you only look at Precision.
Here is what Recall tells you as a PM.
Low recall means your users are getting incomplete answers. They are not seeing important perspectives.
In a search product, low recall is how you lose trust. The user tries your AI answer; it does not solve their problem. They scroll past it to the blue links, and eventually, they stop reading AI Overviews entirely.
If your recall is low, you need to retrieve more documents (increase K). Or you need a better embedding model that captures semantic similarity more broadly.
But notice the tension. Increasing K improves recall but can hurt precision.
You pull in more documents, and some of them will be junk.
This is the fundamental tradeoff you manage as a PM. And it is why you need both metrics, not just one.
The Precision-Recall Tradeoff in RAG Systems
Let us put the numbers side by side.
At K=1: Precision is 1.00, Recall is 0.25.
At K=3: Precision is 0.67, Recall is 0.50.
At K=5: Precision is 0.60, Recall is 0.75.
See the pattern? As K increases, precision drops and recall rises. You are pulling in more documents, which means you find more relevant ones (recall goes up), but you also let in more noise (precision goes down).
This is not a math problem. This is a product problem.
If you are building a medical information feature, you want high recall. Missing a relevant safety warning is unacceptable. You tolerate some noise in exchange for completeness.
If you are building a customer support chatbot where context window tokens are expensive and latency matters, you want high precision. Every irrelevant document wastes tokens and slows response time.
If you are building AI Overviews at Google, you need both to be high. A wrong source embarrasses you publicly. A missing source makes users lose trust. Your job is to find the K that maximises both, and to invest in a retrieval model that pushes the tradeoff curve outward.
This is a classic PM decision. It lives in your PRD. Not in a prompt engineering doc.
Mean Reciprocal Rank (MRR): How Fast Does the User Find What They Need?
Precision and Recall measure quantity. How many relevant documents did you retrieve versus how many exist?
MRR measures something different. It measures speed. How quickly does the first relevant document appear in your ranked results?
MRR stands for Mean Reciprocal Rank. Let us break it down from first principles.
First, Reciprocal Rank.
For a single query, the Reciprocal Rank is 1 divided by the position of the first relevant document.
Go back to our query. “Why does my iPhone battery drain fast after iOS 26 update.”
The first relevant document is at position 1 (the Apple Support page).
Reciprocal Rank = 1/1 = 1.0
Perfect. The user’s first result was relevant. No scrolling needed.
But MRR is a system-level metric. It averages the Reciprocal Rank across multiple queries. Because one query hitting position 1 does not mean your system is good. You need to see the pattern.
Let us say you are evaluating AI Overviews across three queries.
Query 1: “Why does my iPhone battery drain fast after iOS 26 update.” Your system retrieves five documents. The first relevant one is at position 1. Reciprocal Rank = 1/1 = 1.0
Query 2: “How to enable dark mode on MacBook Air.” Your system retrieves five documents. The results are: a Windows dark mode guide at position 1, an unrelated MacBook keyboard shortcut article at position 2, and an Apple Support page on macOS dark mode settings at position 3. The first relevant document is at position 3. Reciprocal Rank = 1/3 = 0.33
Query 3: “Is Apple Vision Pro compatible with prescription lenses?” Your system retrieves five documents. Position 1 is a generic VR headset comparison. Position 2 is Apple’s official Vision Pro page, mentioning Zeiss optical inserts. Relevant. Reciprocal Rank = 1/2 = 0.50
Now you calculate MRR:
MRR = (1.0 + 0.33 + 0.50) / 3 = 1.83 / 3 = 0.61
Your MRR is 0.61.
What does this mean?
An MRR of 1.0 means every single query gets its first relevant document at position 1.
An MRR of 0.5 means, on average, the first relevant document is around position 2.
Your score of 0.61 says users are typically finding a relevant result between positions 1 and 2.
Here is why MRR matters as a product metric.
In a RAG system, the order of retrieved documents affects the generated answer. Most LLMs pay more attention to the documents that appear first in the context.
This is called positional bias.
If your first relevant document is buried at position 3 and positions 1 and 2 are noise, the model might give more weight to the noise.
In a search product, position matters even more directly. Users skim from top to bottom. If the first result is irrelevant, many users bounce. Every position of delay costs you engagement.
MRR captures this. It rewards systems that put relevant results first. Not just systems that retrieve relevant results somewhere in the list.
Precision@K, Recall@K, and MRR Together: A Complete RAG Evaluation
Each metric tells you something different about your retrieval system.
Precision@K tells you: Are you retrieving junk? Is your context window polluted?
Recall@K tells you: Are you missing important documents? Is your coverage complete?
MRR tells you: Are the good results ranked at the top? Is your order right?
You need all three. Here is why.
A system can have high precision and low recall.
It retrieves only two documents, both relevant. Great purity. But it missed eight other relevant documents. The user gets a narrow, incomplete answer.A system can have high recall and low precision.
It retrieves fifty documents and finds all the relevant ones. Great coverage. But it also pulled in forty irrelevant documents that confuse the model and waste tokens.A system can have good precision and recall but bad MRR.
It retrieves five documents; four are relevant. But the one irrelevant document sits at position 1. The model anchors on it. The answer starts wrong.
The PM’s job is to optimise all three simultaneously. That means choosing the right K, selecting the right embedding model, tuning the similarity threshold, and potentially reranking results after initial retrieval.
Why Most RAG Teams Skip Retrieval Evaluation
Here is what I see happen repeatedly.
A team builds a RAG pipeline. They use an off-the-shelf embedding model. They set K to 5 because someone saw it in a tutorial. They evaluate only the final generated answer. The answers look good in a demo. They ship.
Three months later, users complain. The AI answers are kind of right, but missing the point.
The team starts debugging the LLM. They try better prompts. They switch models. They add guardrails. Nothing helps consistently.
The problem was never the LLM. The problem was the retrieval. Nobody measured Precision. Nobody measured Recall. Nobody checked MRR. Nobody even built a golden dataset of relevant documents for their queries.
The retrieval layer determines the quality ceiling of your entire RAG system. If retrieval is broken, no amount of prompt engineering or model upgrades will fix the output consistently.
Evaluate the retrieval first. Fix it first. Then evaluate the generation.
If you want to go deeper on Advanced Evals ( Cohen’s Kappa, Mathew’s Correlation Coefficient (MCC) ), Evals for Agentic Architecture, AI Product Sense, AI Strategy, AI Pricing, AI Prototyping, Advanced Prompting, ML Systems, etc., check out my AI PM course (40+ Videos and 25+ Case Studies ) [Certification Included]
Check our highest-rated AI PM course (Including AI PM Interview Preparation )· 4.9/5 · 600+ enrollments → See testimonials and course details
About Author
Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here ), AI PM Resources




The 'VP sees it in a demo and says ship it' moment is exactly where most AI feature decisions go wrong - the generation step looks polished so everyone assumes the retrieval is fine too. RAG quality problems are almost always retrieval problems wearing a generation costume.