API Documentation - eval.ninja

POST

Authentication

OAuth 2.0 client credentials token

/v1/auth/login

Generate a bearer token for the protected evaluation endpoints.

Organization ID API key Scope (optional)

Request body JSON

{
  "grant_type": "client_credentials",
  "client_id": "your-organization-id",
  "client_secret": "your-api-key",
  "scope": "eval:run"
}

POST

Evaluation

Evaluate RAG samples

/v1/evaluate

Score one RAG output for faithfulness, answer relevancy, and retrieval quality.

Authentication required. Run the token request first. This page stores the token in memory for live calls.

Question Answer Retrieved contexts (one per line)

Metrics

Toggle metrics to include in the request. Dimmed badges need fields not present in the sample above (e.g. reference).

Request body JSON

{
  "user_input": "How can machine learning models handle class imbalance in fraud detection?",
  "response": "Machine learning models can handle class imbalance in fraud detection through several techniques: SMOTE for synthetic data generation, cost-sensitive learning to penalize misclassification of minority classes, ensemble methods like Random Forest with balanced sampling, and threshold tuning to optimize precision-recall trade-offs. Feature engineering and anomaly detection approaches are also effective for identifying rare fraudulent patterns.",
  "retrieved_contexts": [
    "Class imbalance is a common challenge in fraud detection where fraudulent transactions represent less than 1% of all transactions.",
    "SMOTE (Synthetic Minority Oversampling Technique) generates synthetic examples of minority classes to balance datasets.",
    "Cost-sensitive learning assigns different misclassification costs to different classes, making models more sensitive to minority class errors."
  ],
  "reference": null,
  "metrics": [
    "faithfulness",
    "answer_relevancy",
    "context_precision"
  ]
}

Example response 200

{
  "metrics": [
    {
      "name": "faithfulness",
      "score": 0.95,
      "percentage": 95,
      "interpretation": "The answer is well grounded in the retrieved context.",
      "error": null
    },
    {
      "name": "answer_relevancy",
      "score": 0.88,
      "percentage": 88,
      "interpretation": "The answer addresses the question directly.",
      "error": null
    },
    {
      "name": "context_precision",
      "score": 0.92,
      "percentage": 92,
      "interpretation": "Retrieved passages are mostly pertinent to the query.",
      "error": null
    }
  ],
  "summary": {
    "average_score": 0.92,
    "average_percentage": 91.7,
    "successful_metrics": 3,
    "failed_metrics": 0,
    "generation_quality": 0.92,
    "retrieval_quality": 0.92,
    "interpretation": "Strong overall performance across selected metrics.",
    "org_id": "your-org-id",
    "user_id": "your-user-id"
  },
  "request": {
    "user_input": "How can machine learning models handle class imbalance in fraud detection?",
    "response": "Machine learning models can handle class imbalance...",
    "reference": null,
    "retrieved_contexts": [
      "Class imbalance is a common challenge..."
    ],
    "config": null,
    "metrics": [
      "faithfulness",
      "answer_relevancy",
      "context_precision"
    ]
  }
}

GET

Metrics

List available metrics

/v1/metrics

Inspect the metrics supported by the configured API target.

Example response 200

{
  "metrics": [
    "faithfulness",
    "answer_relevancy",
    "context_precision",
    "context_recall",
    "answer_correctness",
    "context_relevancy",
    "answer_similarity",
    "harmfulness",
    "coherence",
    "context_entity_recall",
    "rouge_score",
    "bleu_score",
    "exact_match",
    "string_match",
    "semantic_similarity",
    "aspect_critic"
  ]
}

Ragas Cloud Evaluation API

Switch between cloud or self-hosted

Evaluate RAG samples

List available metrics