Production RAG evaluation API

Eval Ninja Cloud Evaluation API

RAG evaluation API with authentication for cloud deployment

Base URL

Choose the API target

The same live requests can target Eval Ninja Cloud or a self-hosted endpoint.

Current target: https://api.eval.ninja

POST /v1/auth/login

Get an access token

Run this first. Evaluation and metrics requests use the returned bearer token from this page session.

No access token saved yet.

Request JSON
{
  "grant_type": "client_credentials",
  "client_id": "your-organization-id",
  "client_secret": "your-api-key",
  "scope": "eval:run"
}
POST /v1/evaluate

Run an evaluation

Submit one RAG sample and score generation quality, retrieval quality, or both.

Access token required Complete step 1 before running this request.
Metrics

Disabled metrics require fields not included in this sample.

Request JSON
{
  "user_input": "How can machine learning models handle class imbalance in fraud detection?",
  "response": "Machine learning models can handle class imbalance in fraud detection through several techniques: SMOTE for synthetic data generation, cost-sensitive learning to penalize misclassification of minority classes, ensemble methods like Random Forest with balanced sampling, and threshold tuning to optimize precision-recall trade-offs. Feature engineering and anomaly detection approaches are also effective for identifying rare fraudulent patterns.",
  "retrieved_contexts": [
    "Class imbalance is a common challenge in fraud detection where fraudulent transactions represent less than 1% of all transactions.",
    "SMOTE (Synthetic Minority Oversampling Technique) generates synthetic examples of minority classes to balance datasets.",
    "Cost-sensitive learning assigns different misclassification costs to different classes, making models more sensitive to minority class errors."
  ],
  "reference": null,
  "metrics": [
    "faithfulness",
    "answer_relevancy",
    "context_precision"
  ]
}
Response shape 200
91.7% Average
3 Metrics
0 Failed
faithfulness95%
answer_relevancy88%
context_precision92%
{
  "summary": {
    "average_score": 0.92,
    "successful_metrics": 3,
    "failed_metrics": 0,
    "interpretation": "Strong overall performance across selected metrics."
  },
  "metrics": [
    {
      "name": "faithfulness",
      "score": 0.95
    },
    {
      "name": "answer_relevancy",
      "score": 0.88
    },
    {
      "name": "context_precision",
      "score": 0.92
    }
  ]
}
GET /v1/metrics

List available metrics

Inspect the metrics supported by the configured API target.

Example response 200
{
  "metrics": [
    "faithfulness",
    "answer_relevancy",
    "context_precision",
    "context_recall",
    "answer_correctness",
    "context_relevancy",
    "answer_similarity",
    "harmfulness",
    "coherence",
    "context_entity_recall",
    "rouge_score",
    "bleu_score",
    "exact_match",
    "string_match",
    "semantic_similarity",
    "aspect_critic"
  ]
}