RAG is now the default architecture for knowledge-intensive LLM applications. But evaluating a RAG system is still where most teams lose the most time. This guide covers the metrics that matter, how to build a golden dataset, and how to wire evaluation into your development and production pipeline.

Why RAG evaluation is different from LLM evaluation

A RAG pipeline chains two systems: a retriever and a generator. This creates an important diagnostic problem — a correct answer can come from bad retrieval, and a wrong answer can come from good retrieval.

If your retriever returns noisy, partially-relevant chunks, a capable generative model might still synthesize a correct answer. If your retriever returns perfect chunks but the generator hallucinates, you'll see a wrong answer despite excellent retrieval. Aggregate end-to-end metrics will tell you your pipeline is broken, but not which half to fix.

This is why you need to evaluate retrieval and generation separately before aggregating. Skipping this step means you'll spend weeks tuning prompts when the real problem is your chunking strategy, or overhauling your retriever when the issue is a prompt that ignores the provided context.


The retrieval half: metrics that actually matter

Retrieval metrics measure how well your system finds the right documents for a given query. They divide into two categories: rank-unaware metrics that just check what's present, and rank-aware metrics that penalize finding the right document in position 10 vs position 1.

Recall@K

When to use: You have ground truth chunk IDs (e.g., your golden dataset includes which source documents each question should be answered from).

What it measures: Of all the relevant documents that could answer the question, what fraction did you retrieve in your top K results?

Formula: Recall@K = |relevant ∩ retrieved_K| / |relevant|

A recall of 0.85 at K=5 means you're finding 85% of the relevant documents when you retrieve 5 chunks. If recall is low, your embedding model, chunking strategy, or retrieval index needs work.

Precision@K

When to use: When you care about retrieval noise as much as coverage. Retrieving 10 chunks when only 2 are relevant means your generator is working with a lot of noise.

What it measures: Of the K documents you retrieved, what fraction were actually relevant?

Formula: Precision@K = |relevant ∩ retrieved_K| / K

Low precision is a signal that your similarity threshold is too permissive, or your query expansion is generating noise. It's also a cost issue — irrelevant chunks consume context window and increase latency.

Mean Reciprocal Rank (MRR)

When to use: When you only retrieve one document (K=1 RAG patterns), or when your generator heavily weights the first retrieved chunk.

What it measures: The average of 1/rank for the first relevant document across all queries. If the first relevant document appears at rank 1, score is 1.0. At rank 2, it's 0.5. At rank 5, it's 0.2.

MRR is particularly relevant when your prompt template places the first retrieved chunk in the most prominent position, which some models pay more attention to.

Normalized Discounted Cumulative Gain (NDCG)

When to use: When relevance isn't binary — some chunks are more relevant than others. For example, a chunk that directly answers the question should score higher than one that provides tangential context.

NDCG accounts for this graded relevance and also penalizes relevant documents appearing lower in the ranking. It's the most complete retrieval metric but requires more annotation effort (you need graded relevance scores, not just binary relevant/not-relevant labels).

Context relevance (LLM-judged)

When to use: When you don't have ground truth document IDs. This is the most common situation when starting out — you have production traffic but haven't built a labeled evaluation set yet.

An LLM judge reads each retrieved chunk and the original query, then scores whether that chunk is relevant to answering the question. It's noisier than ground truth metrics but requires zero annotation to get started.

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "What is the refund policy for annual subscriptions?",
    "response": "Annual subscriptions can be refunded within 30 days of purchase; refunds take 5-7 business days.",
    "retrieved_contexts": [
      "Annual subscriptions can be refunded within 30 days of purchase.",
      "Our pricing starts at $99/month for the starter plan.",
      "Refunds are processed within 5-7 business days."
    ],
    "metrics": ["context_relevancy"]
  }'
Output
{
  "metrics": [
    {
      "name": "context_relevancy",
      "score": 0.67,
      "percentage": 67,
      "interpretation": "Retrieved passages are partially aligned with the query (pricing chunk less pertinent).",
      "error": null
    }
  ],
  "summary": { "...": "..." },
  "request": { "...": "..." }
}

A context relevance score below 0.6 is a strong signal that you're retrieving noisy chunks. Above 0.85, your retrieval is generally clean enough to move focus to generation quality.


The generation half: faithfulness, relevance, completeness

Generation metrics measure whether the model uses the retrieved context well. Even with perfect retrieval, a model can hallucinate, ignore relevant passages, or give technically correct but unhelpful answers.

Faithfulness
Does every claim in the answer trace back to the retrieved context? Measures hallucination. A faithful answer makes no claims unsupported by provided documents, even if those claims are factually true.
Answer relevance
Does the answer actually address the question asked? A faithful answer can still be irrelevant if it summarizes context without answering the specific question. Measures directness.
Context precision
Of all the retrieved context passed to the model, how much of it was actually used in the answer? Low precision means your generator is ignoring relevant chunks — often a sign of a poorly structured prompt or overly long context.
Context recall
Was the retrieved context sufficient to construct a complete answer? Low recall means important information wasn't in the retrieved chunks — a retrieval problem, not a generation problem.

These four metrics work together as a diagnostic system. Here's what different combinations tell you:

  • Low faithfulness + high context precision → Your model is hallucinating despite using the context. Try a stricter "only use the provided documents" instruction.
  • High faithfulness + low answer relevance → The model is grounded but evasive. It's summarizing context instead of answering the question directly.
  • Low context precision + high context recall → Your context has what's needed but the model isn't using it. Prompt engineering or reranking issue.
  • Low context recall → Almost always a retrieval problem. The model can't answer from what it was given.
import requests

response = requests.post(
    "https://api.eval.ninja/v1/evaluate",
    headers={"Authorization": f"Bearer {EVAL_NINJA_KEY}"},
    json={
        "user_input": "What are the system requirements for running eval.ninja self-hosted?",
        "response": "eval.ninja requires Docker 20.10 or higher and at least 2GB of RAM.",
        "retrieved_contexts": [
            "System requirements: Docker 20.10+, 2GB RAM minimum, 4GB recommended.",
            "Network requirements: outbound HTTPS to your LLM provider's API.",
            "Storage: 500MB for the Docker image, plus space for your evaluation history."
        ],
        "reference": "The minimum system requirements are Docker 20.10 or higher and 2GB RAM.",
        "metrics": ["faithfulness", "answer_relevancy", "context_precision", "context_recall"],
    },
)

result = response.json()
for m in result["metrics"]:
    print(m["name"], m["score"])

Building a golden dataset for RAG

A golden dataset is your test suite for the pipeline. Every evaluation framework needs one, and the quality of your dataset determines the quality of your evaluations. You can't catch regressions you haven't defined.

Where to source examples

Production traces (best): Take real questions your users have asked and label the correct answer and source documents. These are the most representative — they reflect the actual distribution of queries your system will handle. Start here if you have any production traffic at all.

Synthetic generation: Use an LLM to generate questions from your document corpus. "Given this document, generate 5 questions a user might ask that can be answered from this text." Useful for getting initial coverage but tends to produce easier-than-real questions.

Manual curation: Domain experts write adversarial cases — ambiguous questions, edge cases, questions that require synthesizing multiple documents. Expensive but catches the failures that matter most.

The shape of each example

Each example in your golden dataset should contain:

  • Question: The user query, verbatim.
  • Expected answer: A reference answer used for answer relevance and recall scoring.
  • Expected source IDs: The chunk or document IDs that should be retrieved (for Recall@K, Precision@K).
  • Expected context snippets: The specific passages that should inform the answer (for context recall).
{
  "id": "q-0042",
  "question": "Can I export my evaluation history as CSV?",
  "expected_answer": "Yes, eval history can be exported as CSV from the product dashboard.",
  "expected_source_ids": ["doc-export-guide", "doc-api-reference"],
  "expected_contexts": [
    "Evaluation history can be exported as CSV from Settings > Export.",
    "Export formats may include CSV from the dashboard depending on your plan."
  ],
  "metadata": {
    "category": "product-features",
    "difficulty": "easy",
    "source": "production-trace"
  }
}

How many examples is enough

  • 50 examples: Enough to catch major regressions. Use this as your starting CI gate.
  • 200+ examples: Meaningful confidence intervals. Score changes of ±0.03 are now statistically significant.
  • 1000+ examples: Enough to segment by query category (e.g., "pricing questions" vs "technical support") and catch failures that only show up in specific query types. This is the production baseline you want.

Version your dataset like code

Store your golden dataset in version control alongside your application code. Every time you add examples, tag the dataset version in your eval runs. This gives you an accurate picture of whether score changes are due to pipeline changes or dataset changes — a common source of confusion when teams start seeing metric drift.


The evaluation loop: dev, CI, production

Evaluation is only useful if it's embedded in your workflow. Running a one-off eval script is a good start, but the real value comes from blocking bad changes before they reach production.

Dev: fast feedback on every change

Run a 20–50 example subset of your golden dataset on every prompt change or pipeline modification. This should complete in under 2 minutes. The goal isn't statistical rigor — it's catching obvious regressions fast enough that you don't commit broken changes.

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d @eval-sample.json \
  | jq '.summary.average_score'

CI: full suite on every PR

Run your full golden dataset on every pull request that touches the RAG pipeline. Set threshold gates — if faithfulness drops below 0.80, or context recall drops more than 0.05 from the main branch baseline, fail the build. This is your production quality gate.

# .github/workflows/eval.yml
name: RAG Evaluation
on:
  pull_request:
    paths: ['src/rag/**', 'prompts/**']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run evaluation suite
        run: |
          curl -X POST https://api.eval.ninja/v1/evaluate \
            -H "Authorization: Bearer ${{ secrets.EVAL_NINJA_KEY }}" \
            -H "Content-Type: application/json" \
            -d @tests/eval-sample.json \
            -o eval-results.json
      - name: Check thresholds
        run: |
          python scripts/check_eval_thresholds.py \
            --results eval-results.json \
            --min-faithfulness 0.80 \
            --max-regression 0.05

Production: sample and alert on drift

Log 1–5% of live traffic with its retrieved context and generated answer. Run evaluations asynchronously on these samples and alert if rolling scores drop significantly from your CI baseline. This catches drift caused by data distribution changes or upstream model updates — things that your static golden dataset won't catch.


Common pitfalls

Aggregate metrics hiding systematic failures

An overall faithfulness of 0.85 can mask a specific query category scoring 0.40. Always segment your eval results by query type, document category, or any dimension that makes sense for your use case. A 200-example dataset with 10 categories gives you 20 examples per category — enough to surface category-level failures.

Context window truncation in eval not matching prod

If your eval harness passes a different amount of context to the model than your production pipeline, your eval scores won't reflect production behavior. Make sure your eval configuration matches your prod context budget exactly — same max tokens, same truncation strategy, same ordering of chunks.

Judge model drift

LLM-judged metrics like faithfulness and context relevance depend on the judge model. When the judge model is updated, your metric values will shift even if your pipeline hasn't changed. Pin your judge model version and recalibrate against human labels when you upgrade it.

Optimizing for the metric, not the user

It's possible to score very high on faithfulness by writing extremely conservative answers that say "the documents don't contain enough information to answer this question" to every query. Always pair automated metrics with a human review step on a sample of outputs. The metrics tell you where to look; human review tells you whether what you're seeing is actually a problem.


Tooling decision: build vs use a platform

Building your own eval framework works well if you have one RAG application, a Python-first team, and the time to maintain eval infrastructure. You'll write the judge prompts, implement the metrics, build the CI integration, and handle dataset versioning yourself.

A platform pays off when you have multiple applications using the same eval needs, want CI integration without writing the plumbing, need non-engineers to review results, or want hosted history so you can compare runs over time. eval.ninja covers this — same API, same metrics, works from any language with a curl-able REST API and optional Docker self-host for data residency requirements.


Frequently asked questions

RAG evaluation is the process of measuring the quality of a Retrieval Augmented Generation pipeline. It involves separately measuring how well the retrieval step finds relevant documents (retrieval metrics) and how well the generation step produces accurate, grounded answers from those documents (generation metrics). Unlike standard LLM evaluation, you must evaluate both halves independently because failures in each half require different fixes.
For generation: faithfulness (groundedness), answer relevance, context precision, and context recall. For retrieval: Recall@K and Precision@K if you have ground truth document IDs; context relevance (LLM-judged) if you don't. If you had to pick two to start, start with faithfulness and context recall — they diagnose the most common failure modes.
Start with 50 examples to catch regressions. Use 200+ for meaningful confidence intervals. At 1000+ examples you can segment by query type and catch systematic failures. The exact number depends on how many query categories you serve — aim for at least 20–30 examples per category so category-level failures are detectable.
Faithfulness (also called groundedness) measures whether every claim in the generated answer can be traced back to the retrieved context. A faithful answer makes no claims not supported by the provided documents — even if those claims happen to be factually true from the model's training data. It's the primary metric for measuring hallucination in RAG systems.
Use context relevance — an LLM-judged metric that evaluates whether each retrieved chunk is relevant to the user's question, with no ground truth required. It's noisier than labeled Recall@K but requires zero annotation effort to deploy. Once you have production traffic, label a sample of traces to build ground truth data and transition to precision/recall metrics over time.
Context recall measures whether the retrieved context contained all the information needed to answer the question — it's a property of your retrieval step. Answer recall (sometimes called answer correctness) measures whether the generated answer covers all the key points in a reference answer — it's a property of your generation step. Low context recall almost always points to a retrieval fix; low answer recall can be either retrieval or generation.