What metrics does eval.ninja support?

eval.ninja supports the standard RAG evaluation metrics — faithfulness, answer relevance, context precision, context recall, and context relevance — along with custom LLM-as-a-judge rubrics for any open-ended scoring task.

Does my data ever leave my server?

If you self-host with Docker, nothing leaves your infrastructure except the calls you make to your own LLM provider via your own API keys. On the managed cloud, your eval inputs are sent to our judge models, processed, and not used for training.

Can I bring my own API keys?

Yes. On self-host, you bring your own keys for your judge model provider — eval.ninja never sees them. On managed cloud, the judge model is included in your credit usage so you do not need a separate key.

Where can I run the self-hosted Docker image?

Anywhere you can run a container — AWS ECS or Fargate, Google Cloud Run, Azure Container Apps, Fly.io, Railway, Kubernetes, or a plain VM. It also runs as a serverless function for spiky workloads.

Do I need to use Docker?

No. The managed cloud has no install step — sign up and start calling the API. Docker is for teams that need full data isolation or want to run evals inside their own network.

How is pricing structured?

Self-hosting is free — you provide the compute and your own LLM keys. Managed cloud is credit-based, starting with 100 free credits and paid tiers from $2.99 to $30 per month.

API · Self-Host · Cloud

LLM evaluation at the speed of a single API call

eval.ninja runs faithfulness, relevance, and judge-based scoring with no application-layer overhead. Your latency is the model's latency. Self-host with Docker, or use our managed cloud.

Get 100 Free Credits Self-Host with Docker

$ docker pull evalninja/eval:latest

eval.ninja — terminal

eval.ninja — running evaluation suite...
✓Faithfulness0.94
✓Answer Relevance0.87
✓Context Precision0.91
✗Context Recall0.62
3/4 metrics passing · completed in 4.2s

Runs anywhere you run containers

AWS Lambda · ECS / Fargate · Google Cloud Run · Azure Container Apps · Fly.io · Kubernetes · Bare metal

No middleware

Eval latency is
model latency

No queues, no workers, no async pipelines. Your HTTP call hits the judge model and returns. Every other eval platform adds an orchestration layer between your code and the model. We don't.

Others

your app

→ queue

→ worker pool

→ judge model

→ response queue

→ your app

eval.ninja

your app

→ judge model

→ your app

OPENAI_API_KEY=sk-••••••••••••

# stays in your environment, not ours

ANTHROPIC_API_KEY=sk-ant-••••••

# or any provider you choose

# eval.ninja never sees your keys

# you pay your provider directly

Your keys, your cost

No markup.
No middleman.

Self-host with your own LLM provider keys. eval.ninja never sees them. You pay your provider at their posted rate — no token markup, no bundled API fees.

REST, any language

One endpoint, no SDK required. Call from bash, Python, Go, Node, Rust — any HTTP client works.

RAG metrics out of the box

Faithfulness, answer relevance, context precision, context recall. Plus custom LLM-as-a-judge rubrics for any open-ended task.

History and trends

Every run stored and queryable. Compare prompt versions, detect quality drift, gate deployments on score thresholds.

How It Works

Three steps from zero to your first eval

Deploy

Pull the Docker image and run it anywhere — or skip the install entirely and use the managed cloud.

# Self-host

$ docker run -p 8080:8080 \

evalninja/eval:latest

# Or use cloud

# app.eval.ninja — no install

Call the API

One HTTP call per evaluation. Send the question, the answer, and the retrieved context. Get back metric scores with reasoning.

$ curl -X POST \

https://api.eval.ninja/v1/evaluate \

-H "Authorization: Bearer $TOKEN" \

-H "Content-Type: application/json" \

-d '{"user_input":"...",

"user_input":"...",

"response":"...",

"retrieved_contexts":["..."],

"metrics":["faithfulness"]}'

Gate & Iterate

Wire scores into CI to catch regressions on every PR. Track trends over time and ship prompt changes with confidence.

# In CI

{"faithfulness": 0.94,

"reasoning": "answer is

grounded in context..."}

# ✓ Above 0.85 threshold

Two deployment options · same API · same metrics

self-host

Your network,
your keys, your cost

For teams that need eval data and judge calls to stay inside their own infrastructure, or who want to pay their LLM provider directly without markup.

Data isolation — HIPAA, SOC 2, air-gapped networks

No token markup — pay your provider at their posted rate

Runs anywhere — Lambda, ECS, Cloud Run, Kubernetes, bare metal

$ docker run -p 8080:8080 evalninja/eval:latest

Setup guide →

managed cloud

Start calling the API
in under 5 minutes

No infrastructure, no provider keys, no setup. Sign up, copy the API key, start evaluating. Judge model is included in your credits.

Zero setup — no infra to provision or maintain

Judge model included — no provider keys needed

Never used for training — your eval data stays yours

100 free credits included · no credit card required

Create free account →

CLI tool coming soon — works against either deployment.

Pricing

Start free. Pay for what you use.

Self-hosting requires a subscription. Managed cloud uses credits — one credit per evaluation call.

Free

Try it out

$0 /mo

✓ 100 credits included
✓ Basic evaluation metrics
✓ 1 concurrent evaluation
✓ 30 days data retention
✓ Community support

Start Free

Starter

Solo developers

$2.99 /mo

✓ 200 credits per month
✓ All evaluation metrics
✓ 2 concurrent evaluations
✓ 60 days data retention
✓ Email support

Choose Starter

Growth

Best value for teams

$9.99 /mo

✓ 1,500 credits per month
✓ All evaluation metrics
✓ 5 concurrent evaluations
✓ 90 days data retention
✓ Priority email support
✓ Advanced analytics dashboard
✓ Full API access

Choose Growth

Scale

High-volume teams

$30 /mo

✓ 3,800 credits per month
✓ All evaluation metrics
✓ 20 concurrent evaluations
✓ 365 days data retention
✓ Dedicated support channel
✓ Custom integrations
✓ Team collaboration tools

Choose Scale

Self-hosting. Pull the Docker image and bring your own LLM provider keys. Requires a subscription — contact us for pricing.

How eval.ninja compares

Versus typical SaaS-only eval platforms

	eval.ninja	Typical SaaS Eval Tools
Run inside your network	✓ Docker, anywhere	✗ SaaS only
Bring your own keys	✓ No markup on tokens	✗ Bundled, marked up
Application-layer overhead	✓ None	⚠ Queues, workers
Serverless deployment	✓ Lambda, Cloud Run	✗ Vendor-hosted only
Language-agnostic API	✓ REST, any language	⚠ Often Python-first
Pricing model	✓ Free + credits from $2.99	⚠ Seat-based, $50+/mo

Ready to run your first eval?

Get 100 Free Credits Self-Host Guide

Frequently Asked Questions

Stop guessing if your LLM app works.

Self-host with Docker. Or use the managed cloud. Same API either way.

Get 100 Free Credits

No credit card required.