Self-hosting eval.ninja means your evaluation data never leaves your network. This guide covers every supported deployment target — from a one-line Docker command to a production Kubernetes deployment — with working configuration files for each.
Why self-host
There are three reasons teams choose self-hosted over the managed cloud:
Data residency. Your evaluation data contains your questions, answers, and retrieved context — essentially a representative sample of what your users are asking. Many compliance frameworks (HIPAA, SOC 2 Type II, GDPR for EU-resident data) require this data to stay within your infrastructure. Self-hosting is the cleanest path to satisfying those requirements.
BYOK without trusting external infra. eval.ninja uses BYOK — your LLM provider keys go directly from your environment to your provider. With self-hosted, the keys never touch our infrastructure at any point. With our cloud platform, we handle the key securely, but some security policies prohibit any third party from holding API credentials.
Cost predictability at scale. At high evaluation volumes, the cloud platform's per-eval pricing can exceed what you'd pay to run a small VM. A t3.medium on AWS (~$30/month) running the self-hosted image handles thousands of evaluations per day. You pay your LLM provider directly, with no markup.
Architecture overview
The self-hosted eval.ninja image contains:
- API server — the same REST API as cloud, exposed on port 8080
- Metric logic — faithfulness, answer relevance, context precision/recall, context relevance, LLM-as-a-judge
- SQLite database (default) — stores evaluation history, run metadata
The image does not contain a judge model. All LLM calls go to your provider via the keys you configure. This is by design: the eval logic is ours; the model costs are yours, at your provider's rate with no markup.
eval.ninja → your LLM provider (OpenAI, Anthropic, Bedrock, Ollama...) | Data never flows outside this boundary.
Quickstart: one command
docker run -p 8080:8080 \
-e LLM_PROVIDER=openai \
-e OPENAI_API_KEY=sk-... \
evalninja/eval.ninja:latest Confirm it's running:
curl http://localhost:8080/v1/health
# {"status":"ok","version":"1.0.0"} Run your first evaluation:
curl -X POST http://localhost:8080/v1/evaluate \
-H "Content-Type: application/json" \
-d '{
"user_input": "What is the return policy?",
"response": "Returns are accepted within 30 days.",
"retrieved_contexts": ["Items may be returned within 30 days of purchase for a full refund."],
"metrics": ["faithfulness"]
}'
# Response shape: { "metrics": [{ "name", "score", "percentage", "interpretation", "error" }], "summary", "request" } Deployment targets
Docker Compose
For local development or a single-server deployment with persistent history:
# docker-compose.yml
version: '3.8'
services:
eval-ninja:
image: evalninja/eval.ninja:latest
ports:
- "8080:8080"
environment:
- LLM_PROVIDER=openai
- OPENAI_API_KEY=${OPENAI_API_KEY}
- DATABASE_URL=sqlite:///data/eval.db
volumes:
- eval-data:/app/data
restart: unless-stopped
volumes:
eval-data: docker compose up -d Kubernetes
# eval-ninja-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: eval-ninja
spec:
replicas: 2
selector:
matchLabels:
app: eval-ninja
template:
metadata:
labels:
app: eval-ninja
spec:
containers:
- name: eval-ninja
image: evalninja/eval.ninja:latest
ports:
- containerPort: 8080
env:
- name: LLM_PROVIDER
value: openai
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: eval-ninja-secrets
key: openai-api-key
- name: DATABASE_URL
value: postgresql://user:pass@postgres-service/evaldb
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
---
apiVersion: v1
kind: Service
metadata:
name: eval-ninja-service
spec:
selector:
app: eval-ninja
ports:
- port: 80
targetPort: 8080
type: ClusterIP AWS ECS / Fargate
{
"family": "eval-ninja",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"containerDefinitions": [{
"name": "eval-ninja",
"image": "evalninja/eval.ninja:latest",
"portMappings": [{"containerPort": 8080}],
"environment": [
{"name": "LLM_PROVIDER", "value": "openai"}
],
"secrets": [
{
"name": "OPENAI_API_KEY",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456:secret:eval-ninja/openai-key"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/eval-ninja",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}]
} Google Cloud Run
gcloud run deploy eval-ninja \
--image evalninja/eval.ninja:latest \
--platform managed \
--region us-central1 \
--port 8080 \
--memory 1Gi \
--set-env-vars LLM_PROVIDER=openai \
--set-secrets OPENAI_API_KEY=eval-ninja-openai-key:latest \
--allow-unauthenticated Fly.io
# fly.toml
app = "eval-ninja-yourname"
primary_region = "iad"
[build]
image = "evalninja/eval.ninja:latest"
[http_service]
internal_port = 8080
force_https = true
[env]
LLM_PROVIDER = "openai"
[[mounts]]
source = "eval_data"
destination = "/app/data" fly secrets set OPENAI_API_KEY=sk-...
fly deploy Plain VM with systemd
# /etc/systemd/system/eval-ninja.service
[Unit]
Description=eval.ninja evaluation server
After=docker.service
Requires=docker.service
[Service]
Restart=always
ExecStartPre=-/usr/bin/docker stop eval-ninja
ExecStartPre=-/usr/bin/docker rm eval-ninja
ExecStart=/usr/bin/docker run --rm \
--name eval-ninja \
-p 8080:8080 \
-v /var/eval-ninja/data:/app/data \
-e LLM_PROVIDER=openai \
-e OPENAI_API_KEY=sk-... \
evalninja/eval.ninja:latest
ExecStop=/usr/bin/docker stop eval-ninja
[Install]
WantedBy=multi-user.target sudo systemctl enable eval-ninja
sudo systemctl start eval-ninja Configuration reference
BYOK setup per provider
# OpenAI
-e LLM_PROVIDER=openai -e OPENAI_API_KEY=sk-...
# Anthropic
-e LLM_PROVIDER=anthropic -e ANTHROPIC_API_KEY=sk-ant-...
# AWS Bedrock (uses instance role or explicit keys)
-e LLM_PROVIDER=bedrock \
-e AWS_ACCESS_KEY_ID=... \
-e AWS_SECRET_ACCESS_KEY=... \
-e AWS_REGION=us-east-1 \
-e JUDGE_MODEL=anthropic.claude-3-haiku-20240307-v1:0
# Ollama (local, no external calls)
-e LLM_PROVIDER=ollama \
-e OLLAMA_BASE_URL=http://ollama:11434 \
-e JUDGE_MODEL=llama3.1:70b SQLite vs Postgres
SQLite (default) is sufficient for single-node deployments handling up to ~10k eval runs/day. Mount a persistent volume so the file survives container restarts. Backup is a file copy.
Postgres is required for multi-node deployments (Kubernetes with multiple replicas, horizontal scaling). Set DATABASE_URL=postgresql://user:pass@host:5432/evaldb. Schema migrations run automatically on startup.
Serverless deployment
eval.ninja is a natural fit for serverless because evaluations are stateless — each request is independent. Serverless is especially effective for CI/CD use cases where you have bursty, unpredictable workloads.
Cold start consideration: The Docker image is ~500MB. Cold starts on Lambda or Cloud Run are typically 3–8 seconds. This is acceptable for CI pipelines (where latency doesn't affect users) but not for real-time production judging. Use a minimum instance count of 1 if you need sub-second first-request latency.
aws lambda create-function \
--function-name eval-ninja \
--package-type Image \
--code ImageUri=evalninja/eval.ninja-lambda:latest \
--role arn:aws:iam::123456:role/eval-ninja-lambda \
--memory-size 1024 \
--timeout 60 \
--environment Variables='{LLM_PROVIDER=openai,OPENAI_API_KEY=sk-...}' When serverless makes sense:
- CI/CD pipelines that only evaluate on PR (spiky, not continuous)
- Development environments where the instance can scale to zero
- Multi-team setups where each team wants their own isolated endpoint without maintaining servers
When to use a persistent server instead:
- Production sampling (1–5% of traffic) where cold starts would add unacceptable tail latency
- High-volume batch evaluations where cold start overhead amortizes poorly
- When you need persistent SQLite history (serverless requires external Postgres for persistent storage)
Upgrade and backup
# Docker: pull new image and restart
docker pull evalninja/eval.ninja:latest
docker restart eval-ninja
# Docker Compose
docker compose pull
docker compose up -d
# Kubernetes: update image tag and rollout
kubectl set image deployment/eval-ninja eval-ninja=evalninja/eval.ninja:1.2.0
kubectl rollout status deployment/eval-ninja Backup is your SQLite file (if using the default) or your Postgres database. The SQLite file is at the path you configured with DATABASE_URL, or the default /app/data/eval.db inside the container. To backup:
# Backup SQLite from running container
docker exec eval-ninja sqlite3 /app/data/eval.db ".backup '/tmp/backup.db'"
docker cp eval-ninja:/tmp/backup.db ./eval-backup-$(date +%Y%m%d).db