What is a golden dataset for LLM evaluation?

A golden dataset is a curated set of representative test cases with expected answers, expected source evidence, and metadata. It is used to measure whether prompts, models, retrieval settings, or application code changes improve or regress quality.

How many examples should a golden dataset include?

Start with 50 examples for a useful smoke test, move to 200 or more for reliable pull request checks, and use 1000 or more when you need category-level reporting across many query types.

Should golden datasets use synthetic or production examples?

Use production examples when possible because they match real user behavior. Synthetic examples are useful for bootstrapping coverage, but they should be reviewed and supplemented with real traces over time.

How to Build a Golden Dataset for LLM and RAG Evaluation

Short answer

A golden dataset is the evaluation set you trust enough to gate releases. Start with 50 representative examples, label expected answers and source evidence, add metadata for segmentation, and version the dataset like code.

What belongs in a golden dataset

Every example should be specific enough to evaluate both retrieval and generation. For RAG applications, a question and reference answer are not enough. You also need the source evidence that a good retriever should find.

User input: the exact query or task.
Reference answer: the answer a good system should produce.
Expected source IDs: documents or chunks that should be retrieved.
Expected contexts: snippets that contain the answer evidence.
Metadata: category, difficulty, language, customer segment, source, and version.

{
  "id": "refund-042",
  "user_input": "Can I get a refund for an annual plan?",
  "reference": "Annual plans can be refunded within 30 days of purchase.",
  "expected_source_ids": ["billing-refunds", "annual-plan-terms"],
  "expected_contexts": [
    "Annual plans are eligible for a full refund within 30 days of purchase.",
    "Refunds are returned to the original payment method."
  ],
  "metadata": {
    "category": "billing",
    "difficulty": "easy",
    "source": "production_trace",
    "dataset_version": "2025-02-12"
  }
}

Where to source examples

Production traces

Production traces are the best source because they reflect real user wording, ambiguity, and distribution. Start by sampling support questions, search logs, failed conversations, and high-value workflows.

Domain expert cases

Ask subject-matter experts to write edge cases: ambiguous queries, policy exceptions, multi-document questions, and questions where the correct answer is "not enough information." These examples catch the failures that generic synthetic data often misses.

Synthetic bootstrapping

Use synthetic examples to get initial breadth, but review them. Synthetic data often over-represents clean questions and under-represents the messy phrasing users actually type.

How many examples is enough

Dataset size	Use it for	Limitations
50 examples	Fast smoke tests and local development.	Good for catching obvious regressions, weak for category-level analysis.
200+ examples	Pull request checks and threshold gates.	Useful confidence, but still needs careful segmentation.
1000+ examples	Production baselines and per-category reporting.	Requires maintenance discipline and label quality checks.

Version the dataset like application code

Store the dataset in version control and attach the dataset version to every eval run. Otherwise, you will not know whether a score changed because the system improved or because the test set changed.

When changing examples, use the same discipline as code review:

Require a reason for adding, deleting, or relabeling examples.
Keep old examples unless they are invalid or duplicated.
Track categories so new examples do not skew the dataset silently.
Separate smoke-test subsets from full benchmark sets.

Run it in eval.ninja

Once examples include user input, retrieved context, response, and reference fields, they can be sent to eval.ninja from CI or from a scheduled benchmark job.

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d @golden-dataset-sample.json

Common mistakes

Only labeling expected answers

For RAG, expected source evidence is just as important as the answer. Without source labels, you can evaluate final answer quality but cannot reliably diagnose retrieval regressions.

Letting easy synthetic questions dominate

If most examples come from synthetic generation, your system may score well while failing real user questions. Keep a separate field for source and report scores by source type.

Changing thresholds without preserving baselines

Thresholds should move when the product expectation changes, not when a team wants a failing build to pass. Keep baseline run history and require review for threshold changes.