Short answer

The most useful RAG evaluation metrics are faithfulness, answer relevancy, context precision, context recall, and context relevancy. Use them together. A single score cannot tell you whether the retriever failed, the generator hallucinated, or the prompt ignored useful context.

Why RAG needs multiple metrics

A RAG system has at least two moving parts: retrieval and generation. Retrieval decides which context the model sees. Generation decides whether the model uses that context correctly. If you only score the final answer, you know that something failed, but not which part to fix.

The practical goal is diagnosis. A good metric suite should answer four questions:

  • Did retrieval find relevant context?
  • Was the context sufficient to answer the question?
  • Did the answer stay grounded in that context?
  • Did the answer actually respond to the user's question?

The core metric set

Faithfulness
Measures whether claims in the generated answer are supported by retrieved context. Low faithfulness means hallucination or unsupported synthesis.
Answer relevancy
Measures whether the answer addresses the user's question. A grounded answer can still be irrelevant if it avoids the specific ask.
Context precision
Measures how much of the retrieved context was useful. Low precision means noisy retrieval and wasted context window.
Context recall
Measures whether the retrieved context contained enough information to produce a complete answer. Low recall usually means a retrieval problem.
Context relevancy
Measures whether retrieved chunks are relevant to the query, especially useful before you have labeled source-document IDs.

How to interpret score patterns

Pattern Likely cause What to change
Low faithfulness, high context recall The answer had enough evidence but made unsupported claims. Tighten the prompt, require citations, or use a stricter judge threshold.
High faithfulness, low answer relevancy The model stayed grounded but did not answer directly. Rewrite the answer instruction around the user question, not source summary.
Low context precision, high context recall The retriever found the answer plus too much noise. Add reranking, improve chunking, or lower K.
Low context recall The retrieved context did not contain enough evidence. Fix embeddings, chunking, query expansion, filters, or retrieval index coverage.

Example API call

Use the same request to evaluate generation quality and context quality. On cloud, call the hosted endpoint. On self-hosted deployments, replace the base URL with your internal eval.ninja endpoint.

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "Can annual subscriptions be refunded?",
    "response": "Annual subscriptions can be refunded within 30 days of purchase.",
    "retrieved_contexts": [
      "Annual subscriptions can be refunded within 30 days of purchase.",
      "Refunds are returned to the original payment method.",
      "Monthly subscriptions renew automatically."
    ],
    "reference": "Annual subscriptions are refundable within 30 days.",
    "metrics": [
      "faithfulness",
      "answer_relevancy",
      "context_precision",
      "context_recall",
      "context_relevancy"
    ]
  }'
Example result
{
  "metrics": [
    { "name": "faithfulness", "score": 0.96 },
    { "name": "answer_relevancy", "score": 0.91 },
    { "name": "context_precision", "score": 0.67 },
    { "name": "context_recall", "score": 0.88 },
    { "name": "context_relevancy", "score": 0.72 }
  ],
  "summary": {
    "average_score": 0.83,
    "primary_failure": "Some retrieved context was unrelated to the refund question."
  }
}

Common mistakes

Using one average score as the quality gate

Averages hide failures. A run can look healthy while one customer segment, document category, or query type is broken. Gate on individual metrics and segment results by query class.

Confusing context precision with faithfulness

Context precision is about what you retrieved. Faithfulness is about what the model said. If precision is low but faithfulness is high, your answer may be grounded, but retrieval is inefficient and likely fragile.

Evaluating against a different context window than production

If evals use 10 chunks but production uses 5, your scores will not reflect production behavior. Match chunk order, truncation, max tokens, and system prompts.

What to read next