Most RAG evals grade the answer and never check the evidence, so a green dashboard can sit happily on top of a retrieval layer that already failed. If the right context never arrived, the generation scores you are proud of are theater. The only version that tells the truth measures the two halves apart: did retrieval bring back the right material, and then, separately, did the model actually use what it was handed.
Measuring those two halves apart is the part of RAG nobody demos and almost nobody writes down. It first mattered to me the day we moved our matcher from plain vector search to hybrid, when the dashboard said it improved and I needed to know whether that was real or just different. The architecture piece in this series is that hybrid search and the filtering. This one is how you know it works, and how you keep it working as the corpus grows.
Two places it breaks, measured apart
A RAG system has two failure points, and they fail independently. Retrieval can fetch the wrong context, in which case the model never had a chance. Or retrieval can be perfect and the model still answers from its training, ignores what it was handed, or invents a detail that was never there. Score the system end to end with one number and you cannot tell which half broke, so you spend a day fixing the half that was fine. Measure the two sides separately, always.
The four numbers that matter
Four metrics cover most of it. On the retrieval side, context precision asks how much of what you fetched was actually relevant, and context recall asks how much of what you needed you actually got. On the generation side, faithfulness asks whether every claim in the answer traces back to the retrieved context, and answer relevancy asks whether the answer addressed the question at all. Faithfulness is the one that catches a confident hallucination, so it is the one I read first. One thing the retrieval pair comes down to in practice: did the chunk you needed land in the top handful you actually retrieved, because if it did not, no prompt and no larger model downstream can put it back. Protect retrieval recall before anything else.
The four are the floor, not the ceiling. Two more from the 2026 toolkit earn their place on domain-heavy corpora: context entities recall, which checks that the specific names, numbers, and codes survived retrieval rather than just semantically similar prose, and noise sensitivity, which measures how often an irrelevant retrieved chunk pushes the model into a wrong claim. You do not need them on day one, but the day a legal or medical answer drops the one identifier that mattered, you will wish you had the metric that would have caught it.
The golden set is the actual work
The metrics are easy once you have something to score against. Building that something is the job. A golden set is a small file of real questions paired with their known-good answers and the context those answers should come from. Start with thirty to fifty questions drawn from how people actually use the thing, not from what is convenient to ask. A production set grows to a few hundred, which is enough to make a metric move meaningfully without turning every run into an afternoon.
Four kinds of question earn their place. Plain factual ones with a single clear answer. Multi-document questions that force the system to stitch context from more than one place. Ambiguous ones with more than one fair reading. And the kind almost everyone leaves out, the question whose answer is not in your corpus at all.
The question with no answer
That last kind is the one that matters most. A RAG system that always produces a confident answer is not grounded, it is guessing with extra steps. The unanswerable question is the only way to test whether the system will say it does not know. Put a handful in the set and watch what they do. If their score holds while the answerable questions improve, you have a system that respects its own limits. If it answers them anyway, you have a liability that happens to demo beautifully.
A golden set is just a file. Keep it boring and machine-readable, one record per line.
Keep it in version control, next to the code
A golden set belongs in the repository, beside the code it tests, because it is a test. When someone changes the chunk size, swaps the embedding model, or retunes the fusion weight, the suite runs and the numbers say whether the change helped. A drop in context recall after a chunking change is not a mystery to debug for a day; it is the chunking change, and you have the receipt in the diff. Start the passing threshold low, around seventy percent, and raise it as the system earns it. Rotate stale questions out and feed real production failures back in, so the set keeps describing the system you actually run, not the one you shipped six months ago. Watch the unanswerable questions especially: as the corpus grows, a question that had no answer last quarter can quietly gain one, and an unanswerable test that is now answerable is silently grading the wrong thing. Re-verify that handful on every big ingestion.
The judge is useful, and fallible
You are not going to grade a few hundred answers by hand on every commit, so a language model scores most of them for you. It is cheap enough to ignore, well under a dollar for a couple hundred questions, and it scales to thousands of production samples a day. It also fails quietly, and the quiet failure to watch for is that judges tend to reward the better-formatted answer over the more-correct one, so a tidy wrong answer can outscore a plain right one. A judge model also shares blind spots with the model it is grading, and it gets unreliable on exactly the domain-specific language your corpus is full of. Anchor it with a strict rubric that scores facts and not polish, run it every time, and have a human read a sample every so often. Trust the number over the vibe, and still do not trust the number all the way.
What 2muchcoffee covers
We build production RAG, and the eval is not the last thing we add, it is the first, because without it every later decision is a guess wearing a confidence interval. If you have a system that demos well and no way to tell whether your last change helped or hurt, that is the gap we close before anything else. The plain path to start is the AI work we do.
One concrete action
Open a file and write twenty questions a real user would ask your system. For five of them, pick questions your corpus genuinely cannot answer. Run all twenty by hand and read the outputs. You will find the holes inside an hour, before you have built a single metric, and those twenty questions are the seed of the set that tells you the truth from here on.