RAG is now the default architecture for knowledge-intensive LLM applications. But evaluating a RAG system is still where most teams lose the most time. This guide covers the metrics that matter, how to build a golden dataset, and how to wire evaluation into your development and production pipeline.
Why RAG evaluation is different from LLM evaluation
A RAG pipeline chains two systems: a retriever and a generator. This creates an important diagnostic problem — a correct answer can come from bad retrieval, and a wrong answer can come from good retrieval.
If your retriever returns noisy, partially-relevant chunks, a capable generative model might still synthesize a correct answer. If your retriever returns perfect chunks but the generator hallucinates, you'll see a wrong answer despite excellent retrieval. Aggregate end-to-end metrics will tell you your pipeline is broken, but not which half to fix.
This is why you need to evaluate retrieval and generation separately before aggregating. Skipping this step means you'll spend weeks tuning prompts when the real problem is your chunking strategy, or overhauling your retriever when the issue is a prompt that ignores the provided context.
The retrieval half: metrics that actually matter
Retrieval metrics measure how well your system finds the right documents for a given query. They divide into two categories: rank-unaware metrics that just check what's present, and rank-aware metrics that penalize finding the right document in position 10 vs position 1.
Recall@K
When to use: You have ground truth chunk IDs (e.g., your golden dataset includes which source documents each question should be answered from).
What it measures: Of all the relevant documents that could answer the question, what fraction did you retrieve in your top K results?
Formula: Recall@K = |relevant ∩ retrieved_K| / |relevant|
A recall of 0.85 at K=5 means you're finding 85% of the relevant documents when you retrieve 5 chunks. If recall is low, your embedding model, chunking strategy, or retrieval index needs work.
Precision@K
When to use: When you care about retrieval noise as much as coverage. Retrieving 10 chunks when only 2 are relevant means your generator is working with a lot of noise.
What it measures: Of the K documents you retrieved, what fraction were actually relevant?
Formula: Precision@K = |relevant ∩ retrieved_K| / K
Low precision is a signal that your similarity threshold is too permissive, or your query expansion is generating noise. It's also a cost issue — irrelevant chunks consume context window and increase latency.
Mean Reciprocal Rank (MRR)
When to use: When you only retrieve one document (K=1 RAG patterns), or when your generator heavily weights the first retrieved chunk.
What it measures: The average of 1/rank for the first relevant document across all queries. If the first relevant document appears at rank 1, score is 1.0. At rank 2, it's 0.5. At rank 5, it's 0.2.
MRR is particularly relevant when your prompt template places the first retrieved chunk in the most prominent position, which some models pay more attention to.
Normalized Discounted Cumulative Gain (NDCG)
When to use: When relevance isn't binary — some chunks are more relevant than others. For example, a chunk that directly answers the question should score higher than one that provides tangential context.
NDCG accounts for this graded relevance and also penalizes relevant documents appearing lower in the ranking. It's the most complete retrieval metric but requires more annotation effort (you need graded relevance scores, not just binary relevant/not-relevant labels).
Context relevance (LLM-judged)
When to use: When you don't have ground truth document IDs. This is the most common situation when starting out — you have production traffic but haven't built a labeled evaluation set yet.
An LLM judge reads each retrieved chunk and the original query, then scores whether that chunk is relevant to answering the question. It's noisier than ground truth metrics but requires zero annotation to get started.
curl -X POST https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $EVAL_NINJA_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_input": "What is the refund policy for annual subscriptions?",
"response": "Annual subscriptions can be refunded within 30 days of purchase; refunds take 5-7 business days.",
"retrieved_contexts": [
"Annual subscriptions can be refunded within 30 days of purchase.",
"Our pricing starts at $99/month for the starter plan.",
"Refunds are processed within 5-7 business days."
],
"metrics": ["context_relevancy"]
}' {
"metrics": [
{
"name": "context_relevancy",
"score": 0.67,
"percentage": 67,
"interpretation": "Retrieved passages are partially aligned with the query (pricing chunk less pertinent).",
"error": null
}
],
"summary": { "...": "..." },
"request": { "...": "..." }
} A context relevance score below 0.6 is a strong signal that you're retrieving noisy chunks. Above 0.85, your retrieval is generally clean enough to move focus to generation quality.
The generation half: faithfulness, relevance, completeness
Generation metrics measure whether the model uses the retrieved context well. Even with perfect retrieval, a model can hallucinate, ignore relevant passages, or give technically correct but unhelpful answers.
These four metrics work together as a diagnostic system. Here's what different combinations tell you:
- Low faithfulness + high context precision → Your model is hallucinating despite using the context. Try a stricter "only use the provided documents" instruction.
- High faithfulness + low answer relevance → The model is grounded but evasive. It's summarizing context instead of answering the question directly.
- Low context precision + high context recall → Your context has what's needed but the model isn't using it. Prompt engineering or reranking issue.
- Low context recall → Almost always a retrieval problem. The model can't answer from what it was given.
import requests
response = requests.post(
"https://api.eval.ninja/v1/evaluate",
headers={"Authorization": f"Bearer {EVAL_NINJA_KEY}"},
json={
"user_input": "What are the system requirements for running eval.ninja self-hosted?",
"response": "eval.ninja requires Docker 20.10 or higher and at least 2GB of RAM.",
"retrieved_contexts": [
"System requirements: Docker 20.10+, 2GB RAM minimum, 4GB recommended.",
"Network requirements: outbound HTTPS to your LLM provider's API.",
"Storage: 500MB for the Docker image, plus space for your evaluation history."
],
"reference": "The minimum system requirements are Docker 20.10 or higher and 2GB RAM.",
"metrics": ["faithfulness", "answer_relevancy", "context_precision", "context_recall"],
},
)
result = response.json()
for m in result["metrics"]:
print(m["name"], m["score"])
Building a golden dataset for RAG
A golden dataset is your test suite for the pipeline. Every evaluation framework needs one, and the quality of your dataset determines the quality of your evaluations. You can't catch regressions you haven't defined.
Where to source examples
Production traces (best): Take real questions your users have asked and label the correct answer and source documents. These are the most representative — they reflect the actual distribution of queries your system will handle. Start here if you have any production traffic at all.
Synthetic generation: Use an LLM to generate questions from your document corpus. "Given this document, generate 5 questions a user might ask that can be answered from this text." Useful for getting initial coverage but tends to produce easier-than-real questions.
Manual curation: Domain experts write adversarial cases — ambiguous questions, edge cases, questions that require synthesizing multiple documents. Expensive but catches the failures that matter most.
The shape of each example
Each example in your golden dataset should contain:
- Question: The user query, verbatim.
- Expected answer: A reference answer used for answer relevance and recall scoring.
- Expected source IDs: The chunk or document IDs that should be retrieved (for Recall@K, Precision@K).
- Expected context snippets: The specific passages that should inform the answer (for context recall).
{
"id": "q-0042",
"question": "Can I export my evaluation history as CSV?",
"expected_answer": "Yes, eval history can be exported as CSV from the product dashboard.",
"expected_source_ids": ["doc-export-guide", "doc-api-reference"],
"expected_contexts": [
"Evaluation history can be exported as CSV from Settings > Export.",
"Export formats may include CSV from the dashboard depending on your plan."
],
"metadata": {
"category": "product-features",
"difficulty": "easy",
"source": "production-trace"
}
} How many examples is enough
- 50 examples: Enough to catch major regressions. Use this as your starting CI gate.
- 200+ examples: Meaningful confidence intervals. Score changes of ±0.03 are now statistically significant.
- 1000+ examples: Enough to segment by query category (e.g., "pricing questions" vs "technical support") and catch failures that only show up in specific query types. This is the production baseline you want.
Version your dataset like code
Store your golden dataset in version control alongside your application code. Every time you add examples, tag the dataset version in your eval runs. This gives you an accurate picture of whether score changes are due to pipeline changes or dataset changes — a common source of confusion when teams start seeing metric drift.
The evaluation loop: dev, CI, production
Evaluation is only useful if it's embedded in your workflow. Running a one-off eval script is a good start, but the real value comes from blocking bad changes before they reach production.
Dev: fast feedback on every change
Run a 20–50 example subset of your golden dataset on every prompt change or pipeline modification. This should complete in under 2 minutes. The goal isn't statistical rigor — it's catching obvious regressions fast enough that you don't commit broken changes.
curl -X POST https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $EVAL_NINJA_KEY" \
-H "Content-Type: application/json" \
-d @eval-sample.json \
| jq '.summary.average_score' CI: full suite on every PR
Run your full golden dataset on every pull request that touches the RAG pipeline. Set threshold gates — if faithfulness drops below 0.80, or context recall drops more than 0.05 from the main branch baseline, fail the build. This is your production quality gate.
# .github/workflows/eval.yml
name: RAG Evaluation
on:
pull_request:
paths: ['src/rag/**', 'prompts/**']
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluation suite
run: |
curl -X POST https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer ${{ secrets.EVAL_NINJA_KEY }}" \
-H "Content-Type: application/json" \
-d @tests/eval-sample.json \
-o eval-results.json
- name: Check thresholds
run: |
python scripts/check_eval_thresholds.py \
--results eval-results.json \
--min-faithfulness 0.80 \
--max-regression 0.05 Production: sample and alert on drift
Log 1–5% of live traffic with its retrieved context and generated answer. Run evaluations asynchronously on these samples and alert if rolling scores drop significantly from your CI baseline. This catches drift caused by data distribution changes or upstream model updates — things that your static golden dataset won't catch.
Common pitfalls
Aggregate metrics hiding systematic failures
An overall faithfulness of 0.85 can mask a specific query category scoring 0.40. Always segment your eval results by query type, document category, or any dimension that makes sense for your use case. A 200-example dataset with 10 categories gives you 20 examples per category — enough to surface category-level failures.
Context window truncation in eval not matching prod
If your eval harness passes a different amount of context to the model than your production pipeline, your eval scores won't reflect production behavior. Make sure your eval configuration matches your prod context budget exactly — same max tokens, same truncation strategy, same ordering of chunks.
Judge model drift
LLM-judged metrics like faithfulness and context relevance depend on the judge model. When the judge model is updated, your metric values will shift even if your pipeline hasn't changed. Pin your judge model version and recalibrate against human labels when you upgrade it.
Optimizing for the metric, not the user
It's possible to score very high on faithfulness by writing extremely conservative answers that say "the documents don't contain enough information to answer this question" to every query. Always pair automated metrics with a human review step on a sample of outputs. The metrics tell you where to look; human review tells you whether what you're seeing is actually a problem.
Tooling decision: build vs use a platform
Building your own eval framework works well if you have one RAG application, a Python-first team, and the time to maintain eval infrastructure. You'll write the judge prompts, implement the metrics, build the CI integration, and handle dataset versioning yourself.
A platform pays off when you have multiple applications using the same eval needs, want CI integration without writing the plumbing, need non-engineers to review results, or want hosted history so you can compare runs over time. eval.ninja covers this — same API, same metrics, works from any language with a curl-able REST API and optional Docker self-host for data residency requirements.