What are the most important RAG evaluation metrics?

The core RAG metrics are faithfulness, answer relevancy, context precision, context recall, and context relevancy. Together they show whether retrieval found the right evidence and whether generation used that evidence correctly.

Do I need labeled data for RAG evaluation?

You can start without labeled data by using LLM-judged metrics like faithfulness, answer relevancy, and context relevancy. Labeled data becomes important when you want precise retrieval metrics like Recall@K and Precision@K.

RAG Evaluation Metrics: Faithfulness, Relevance, Context Precision, and Recall

Q: What is the difference between faithfulness and answer relevancy?

Faithfulness measures whether the answer is supported by the retrieved context. Answer relevancy measures whether the answer addresses the user's question. An answer can be faithful but irrelevant if it summarizes context without answering the question.

Short answer

The most useful RAG evaluation metrics are faithfulness, answer relevancy, context precision, context recall, and context relevancy. Use them together. A single score cannot tell you whether the retriever failed, the generator hallucinated, or the prompt ignored useful context.

Why RAG needs multiple metrics

A RAG system has at least two moving parts: retrieval and generation. Retrieval decides which context the model sees. Generation decides whether the model uses that context correctly. If you only score the final answer, you know that something failed, but not which part to fix.

The practical goal is diagnosis. A good metric suite should answer four questions:

Did retrieval find relevant context?
Was the context sufficient to answer the question?
Did the answer stay grounded in that context?
Did the answer actually respond to the user's question?

The core metric set

Faithfulness

Measures whether claims in the generated answer are supported by retrieved context. Low faithfulness means hallucination or unsupported synthesis.

Answer relevancy

Measures whether the answer addresses the user's question. A grounded answer can still be irrelevant if it avoids the specific ask.

Context precision

Measures how much of the retrieved context was useful. Low precision means noisy retrieval and wasted context window.

Context recall

Measures whether the retrieved context contained enough information to produce a complete answer. Low recall usually means a retrieval problem.

Context relevancy

Measures whether retrieved chunks are relevant to the query, especially useful before you have labeled source-document IDs.

How to interpret score patterns

Pattern	Likely cause	What to change
Low faithfulness, high context recall	The answer had enough evidence but made unsupported claims.	Tighten the prompt, require citations, or use a stricter judge threshold.
High faithfulness, low answer relevancy	The model stayed grounded but did not answer directly.	Rewrite the answer instruction around the user question, not source summary.
Low context precision, high context recall	The retriever found the answer plus too much noise.	Add reranking, improve chunking, or lower K.
Low context recall	The retrieved context did not contain enough evidence.	Fix embeddings, chunking, query expansion, filters, or retrieval index coverage.

Example API call

Use the same request to evaluate generation quality and context quality. On cloud, call the hosted endpoint. On self-hosted deployments, replace the base URL with your internal eval.ninja endpoint.

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "Can annual subscriptions be refunded?",
    "response": "Annual subscriptions can be refunded within 30 days of purchase.",
    "retrieved_contexts": [
      "Annual subscriptions can be refunded within 30 days of purchase.",
      "Refunds are returned to the original payment method.",
      "Monthly subscriptions renew automatically."
    ],
    "reference": "Annual subscriptions are refundable within 30 days.",
    "metrics": [
      "faithfulness",
      "answer_relevancy",
      "context_precision",
      "context_recall",
      "context_relevancy"
    ]
  }'

Example result

{
  "metrics": [
    { "name": "faithfulness", "score": 0.96 },
    { "name": "answer_relevancy", "score": 0.91 },
    { "name": "context_precision", "score": 0.67 },
    { "name": "context_recall", "score": 0.88 },
    { "name": "context_relevancy", "score": 0.72 }
  ],
  "summary": {
    "average_score": 0.83,
    "primary_failure": "Some retrieved context was unrelated to the refund question."
  }
}

Common mistakes

Using one average score as the quality gate

Averages hide failures. A run can look healthy while one customer segment, document category, or query type is broken. Gate on individual metrics and segment results by query class.

Confusing context precision with faithfulness

Context precision is about what you retrieved. Faithfulness is about what the model said. If precision is low but faithfulness is high, your answer may be grounded, but retrieval is inefficient and likely fragile.

Evaluating against a different context window than production

If evals use 10 chunks but production uses 5, your scores will not reflect production behavior. Match chunk order, truncation, max tokens, and system prompts.