LLM evaluation at the speed of a single API call
eval.ninja runs faithfulness, relevance, and judge-based scoring with no application-layer overhead. Your latency is the model's latency. Self-host with Docker, or use our managed cloud.
Runs anywhere you run containers
Eval latency is
model latency
No queues, no workers, no async pipelines. Your HTTP call hits the judge model and returns. Every other eval platform adds an orchestration layer between your code and the model. We don't.
No markup.
No middleman.
Self-host with your own LLM provider keys. eval.ninja never sees them. You pay your provider at their posted rate — no token markup, no bundled API fees.
REST, any language
One endpoint, no SDK required. Call from bash, Python, Go, Node, Rust — any HTTP client works.
RAG metrics out of the box
Faithfulness, answer relevance, context precision, context recall. Plus custom LLM-as-a-judge rubrics for any open-ended task.
History and trends
Every run stored and queryable. Compare prompt versions, detect quality drift, gate deployments on score thresholds.
How It Works
Three steps from zero to your first eval
Deploy
Pull the Docker image and run it anywhere — or skip the install entirely and use the managed cloud.
Call the API
One HTTP call per evaluation. Send the question, the answer, and the retrieved context. Get back metric scores with reasoning.
Gate & Iterate
Wire scores into CI to catch regressions on every PR. Track trends over time and ship prompt changes with confidence.
Your network,
your keys, your cost
For teams that need eval data and judge calls to stay inside their own infrastructure, or who want to pay their LLM provider directly without markup.
Start calling the API
in under 5 minutes
No infrastructure, no provider keys, no setup. Sign up, copy the API key, start evaluating. Judge model is included in your credits.
Start free. Pay for what you use.
Self-hosting requires a subscription. Managed cloud uses credits — one credit per evaluation call.
Free
Try it out
- ✓ 100 credits included
- ✓ Basic evaluation metrics
- ✓ 1 concurrent evaluation
- ✓ 30 days data retention
- ✓ Community support
Starter
Solo developers
- ✓ 200 credits per month
- ✓ All evaluation metrics
- ✓ 2 concurrent evaluations
- ✓ 60 days data retention
- ✓ Email support
Growth
Best value for teams
- ✓ 1,500 credits per month
- ✓ All evaluation metrics
- ✓ 5 concurrent evaluations
- ✓ 90 days data retention
- ✓ Priority email support
- ✓ Advanced analytics dashboard
- ✓ Full API access
Scale
High-volume teams
- ✓ 3,800 credits per month
- ✓ All evaluation metrics
- ✓ 20 concurrent evaluations
- ✓ 365 days data retention
- ✓ Dedicated support channel
- ✓ Custom integrations
- ✓ Team collaboration tools
How eval.ninja compares
Versus typical SaaS-only eval platforms
| eval.ninja | Typical SaaS Eval Tools | |
|---|---|---|
| Run inside your network | ✓ Docker, anywhere | ✗ SaaS only |
| Bring your own keys | ✓ No markup on tokens | ✗ Bundled, marked up |
| Application-layer overhead | ✓ None | ⚠ Queues, workers |
| Serverless deployment | ✓ Lambda, Cloud Run | ✗ Vendor-hosted only |
| Language-agnostic API | ✓ REST, any language | ⚠ Often Python-first |
| Pricing model | ✓ Free + credits from $2.99 | ⚠ Seat-based, $50+/mo |
Ready to run your first eval?
Frequently Asked Questions
Stop guessing if your LLM app works.
Self-host with Docker. Or use the managed cloud. Same API either way.
Get 100 Free Credits