API · Self-Host · Cloud

LLM evaluation at the speed of a single API call

eval.ninja runs faithfulness, relevance, and judge-based scoring with no application-layer overhead. Your latency is the model's latency. Self-host with Docker, or use our managed cloud.

$ docker pull evalninja/eval:latest
eval.ninja — terminal
eval.ninja — running evaluation suite...
Faithfulness 0.94
Answer Relevance 0.87
Context Precision 0.91
Context Recall 0.62
3/4 metrics passing · completed in 4.2s

Runs anywhere you run containers

AWS Lambda · ECS / Fargate · Google Cloud Run · Azure Container Apps · Fly.io · Kubernetes · Bare metal
No middleware

Eval latency is
model latency

No queues, no workers, no async pipelines. Your HTTP call hits the judge model and returns. Every other eval platform adds an orchestration layer between your code and the model. We don't.

Others
your app
→ queue
→ worker pool
→ judge model
→ response queue
→ your app
eval.ninja
your app
→ judge model
→ your app
OPENAI_API_KEY=sk-••••••••••••
# stays in your environment, not ours
ANTHROPIC_API_KEY=sk-ant-••••••
# or any provider you choose
# eval.ninja never sees your keys
# you pay your provider directly
Your keys, your cost

No markup.
No middleman.

Self-host with your own LLM provider keys. eval.ninja never sees them. You pay your provider at their posted rate — no token markup, no bundled API fees.

REST, any language

One endpoint, no SDK required. Call from bash, Python, Go, Node, Rust — any HTTP client works.

RAG metrics out of the box

Faithfulness, answer relevance, context precision, context recall. Plus custom LLM-as-a-judge rubrics for any open-ended task.

History and trends

Every run stored and queryable. Compare prompt versions, detect quality drift, gate deployments on score thresholds.

How It Works

Three steps from zero to your first eval

01

Deploy

Pull the Docker image and run it anywhere — or skip the install entirely and use the managed cloud.

# Self-host
$ docker run -p 8080:8080 \
evalninja/eval:latest
# Or use cloud
# app.eval.ninja — no install
02

Call the API

One HTTP call per evaluation. Send the question, the answer, and the retrieved context. Get back metric scores with reasoning.

$ curl -X POST \
https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"user_input":"...",
"user_input":"...",
"response":"...",
"retrieved_contexts":["..."],
"metrics":["faithfulness"]}'
03

Gate & Iterate

Wire scores into CI to catch regressions on every PR. Track trends over time and ship prompt changes with confidence.

# In CI
{"faithfulness": 0.94,
"reasoning": "answer is
grounded in context..."}
# ✓ Above 0.85 threshold
Two deployment options  ·  same API  ·  same metrics
self-host

Your network,
your keys, your cost

For teams that need eval data and judge calls to stay inside their own infrastructure, or who want to pay their LLM provider directly without markup.

Data isolation  —  HIPAA, SOC 2, air-gapped networks
No token markup  —  pay your provider at their posted rate
Runs anywhere  —  Lambda, ECS, Cloud Run, Kubernetes, bare metal
$ docker run -p 8080:8080 evalninja/eval:latest
Setup guide →
managed cloud

Start calling the API
in under 5 minutes

No infrastructure, no provider keys, no setup. Sign up, copy the API key, start evaluating. Judge model is included in your credits.

Zero setup  —  no infra to provision or maintain
Judge model included  —  no provider keys needed
Never used for training  —  your eval data stays yours
100 free credits included  ·  no credit card required
Create free account →
CLI tool coming soon — works against either deployment.
Pricing

Start free. Pay for what you use.

Self-hosting requires a subscription. Managed cloud uses credits — one credit per evaluation call.

Free

Try it out

$0 /mo
  • 100 credits included
  • Basic evaluation metrics
  • 1 concurrent evaluation
  • 30 days data retention
  • Community support
Start Free

Starter

Solo developers

$2.99 /mo
  • 200 credits per month
  • All evaluation metrics
  • 2 concurrent evaluations
  • 60 days data retention
  • Email support
Choose Starter
Most Popular

Growth

Best value for teams

$9.99 /mo
  • 1,500 credits per month
  • All evaluation metrics
  • 5 concurrent evaluations
  • 90 days data retention
  • Priority email support
  • Advanced analytics dashboard
  • Full API access
Choose Growth

Scale

High-volume teams

$30 /mo
  • 3,800 credits per month
  • All evaluation metrics
  • 20 concurrent evaluations
  • 365 days data retention
  • Dedicated support channel
  • Custom integrations
  • Team collaboration tools
Choose Scale
Self-hosting. Pull the Docker image and bring your own LLM provider keys. Requires a subscription — contact us for pricing.

How eval.ninja compares

Versus typical SaaS-only eval platforms

eval.ninja Typical SaaS Eval Tools
Run inside your network
Docker, anywhere
SaaS only
Bring your own keys
No markup on tokens
Bundled, marked up
Application-layer overhead
None
Queues, workers
Serverless deployment
Lambda, Cloud Run
Vendor-hosted only
Language-agnostic API
REST, any language
Often Python-first
Pricing model
Free + credits from $2.99
Seat-based, $50+/mo

Ready to run your first eval?

Frequently Asked Questions

Stop guessing if your LLM app works.

Self-host with Docker. Or use the managed cloud. Same API either way.

Get 100 Free Credits
No credit card required.