Self-hosting eval.ninja means your evaluation data never leaves your network. This guide covers every supported deployment target — from a one-line Docker command to a production Kubernetes deployment — with working configuration files for each.

Why self-host

There are three reasons teams choose self-hosted over the managed cloud:

Data residency. Your evaluation data contains your questions, answers, and retrieved context — essentially a representative sample of what your users are asking. Many compliance frameworks (HIPAA, SOC 2 Type II, GDPR for EU-resident data) require this data to stay within your infrastructure. Self-hosting is the cleanest path to satisfying those requirements.

BYOK without trusting external infra. eval.ninja uses BYOK — your LLM provider keys go directly from your environment to your provider. With self-hosted, the keys never touch our infrastructure at any point. With our cloud platform, we handle the key securely, but some security policies prohibit any third party from holding API credentials.

Cost predictability at scale. At high evaluation volumes, the cloud platform's per-eval pricing can exceed what you'd pay to run a small VM. A t3.medium on AWS (~$30/month) running the self-hosted image handles thousands of evaluations per day. You pay your LLM provider directly, with no markup.


Architecture overview

The self-hosted eval.ninja image contains:

  • API server — the same REST API as cloud, exposed on port 8080
  • Metric logic — faithfulness, answer relevance, context precision/recall, context relevance, LLM-as-a-judge
  • SQLite database (default) — stores evaluation history, run metadata

The image does not contain a judge model. All LLM calls go to your provider via the keys you configure. This is by design: the eval logic is ours; the model costs are yours, at your provider's rate with no markup.

Key boundary

eval.ninja → your LLM provider (OpenAI, Anthropic, Bedrock, Ollama...)  |  Data never flows outside this boundary.


Quickstart: one command

docker run -p 8080:8080 \
  -e LLM_PROVIDER=openai \
  -e OPENAI_API_KEY=sk-... \
  evalninja/eval.ninja:latest

Confirm it's running:

curl http://localhost:8080/v1/health
# {"status":"ok","version":"1.0.0"}

Run your first evaluation:

curl -X POST http://localhost:8080/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "What is the return policy?",
    "response": "Returns are accepted within 30 days.",
    "retrieved_contexts": ["Items may be returned within 30 days of purchase for a full refund."],
    "metrics": ["faithfulness"]
  }'
# Response shape: { "metrics": [{ "name", "score", "percentage", "interpretation", "error" }], "summary", "request" }

Deployment targets

Docker Compose

For local development or a single-server deployment with persistent history:

# docker-compose.yml
version: '3.8'
services:
  eval-ninja:
    image: evalninja/eval.ninja:latest
    ports:
      - "8080:8080"
    environment:
      - LLM_PROVIDER=openai
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DATABASE_URL=sqlite:///data/eval.db
    volumes:
      - eval-data:/app/data
    restart: unless-stopped

volumes:
  eval-data:
docker compose up -d

Kubernetes

# eval-ninja-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eval-ninja
spec:
  replicas: 2
  selector:
    matchLabels:
      app: eval-ninja
  template:
    metadata:
      labels:
        app: eval-ninja
    spec:
      containers:
        - name: eval-ninja
          image: evalninja/eval.ninja:latest
          ports:
            - containerPort: 8080
          env:
            - name: LLM_PROVIDER
              value: openai
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: eval-ninja-secrets
                  key: openai-api-key
            - name: DATABASE_URL
              value: postgresql://user:pass@postgres-service/evaldb
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
---
apiVersion: v1
kind: Service
metadata:
  name: eval-ninja-service
spec:
  selector:
    app: eval-ninja
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP

AWS ECS / Fargate

{
  "family": "eval-ninja",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [{
    "name": "eval-ninja",
    "image": "evalninja/eval.ninja:latest",
    "portMappings": [{"containerPort": 8080}],
    "environment": [
      {"name": "LLM_PROVIDER", "value": "openai"}
    ],
    "secrets": [
      {
        "name": "OPENAI_API_KEY",
        "valueFrom": "arn:aws:secretsmanager:us-east-1:123456:secret:eval-ninja/openai-key"
      }
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/eval-ninja",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }]
}

Google Cloud Run

gcloud run deploy eval-ninja \
  --image evalninja/eval.ninja:latest \
  --platform managed \
  --region us-central1 \
  --port 8080 \
  --memory 1Gi \
  --set-env-vars LLM_PROVIDER=openai \
  --set-secrets OPENAI_API_KEY=eval-ninja-openai-key:latest \
  --allow-unauthenticated

Fly.io

# fly.toml
app = "eval-ninja-yourname"
primary_region = "iad"

[build]
  image = "evalninja/eval.ninja:latest"

[http_service]
  internal_port = 8080
  force_https = true

[env]
  LLM_PROVIDER = "openai"

[[mounts]]
  source = "eval_data"
  destination = "/app/data"
fly secrets set OPENAI_API_KEY=sk-...
fly deploy

Plain VM with systemd

# /etc/systemd/system/eval-ninja.service
[Unit]
Description=eval.ninja evaluation server
After=docker.service
Requires=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker stop eval-ninja
ExecStartPre=-/usr/bin/docker rm eval-ninja
ExecStart=/usr/bin/docker run --rm \
  --name eval-ninja \
  -p 8080:8080 \
  -v /var/eval-ninja/data:/app/data \
  -e LLM_PROVIDER=openai \
  -e OPENAI_API_KEY=sk-... \
  evalninja/eval.ninja:latest
ExecStop=/usr/bin/docker stop eval-ninja

[Install]
WantedBy=multi-user.target
sudo systemctl enable eval-ninja
sudo systemctl start eval-ninja

Configuration reference

Variable Default Description
LLM_PROVIDERopenaiLLM provider: openai, anthropic, bedrock, ollama
OPENAI_API_KEYRequired if LLM_PROVIDER=openai
ANTHROPIC_API_KEYRequired if LLM_PROVIDER=anthropic
JUDGE_MODELgpt-4o-miniModel to use for LLM-judged metrics
DATABASE_URLsqlite:///data/eval.dbSQLite path or Postgres connection string
PORT8080HTTP port to listen on
LOG_LEVELinfodebug, info, warn, error
LOG_FORMATjsonjson for structured logs, text for development
OTEL_EXPORTER_OTLP_ENDPOINTOpenTelemetry collector endpoint for tracing
MAX_CONCURRENT_EVALS10Max parallel LLM calls per request
RATE_LIMIT_RPM60Requests per minute to your LLM provider

BYOK setup per provider

# OpenAI
-e LLM_PROVIDER=openai -e OPENAI_API_KEY=sk-...

# Anthropic
-e LLM_PROVIDER=anthropic -e ANTHROPIC_API_KEY=sk-ant-...

# AWS Bedrock (uses instance role or explicit keys)
-e LLM_PROVIDER=bedrock \
-e AWS_ACCESS_KEY_ID=... \
-e AWS_SECRET_ACCESS_KEY=... \
-e AWS_REGION=us-east-1 \
-e JUDGE_MODEL=anthropic.claude-3-haiku-20240307-v1:0

# Ollama (local, no external calls)
-e LLM_PROVIDER=ollama \
-e OLLAMA_BASE_URL=http://ollama:11434 \
-e JUDGE_MODEL=llama3.1:70b

SQLite vs Postgres

SQLite (default) is sufficient for single-node deployments handling up to ~10k eval runs/day. Mount a persistent volume so the file survives container restarts. Backup is a file copy.

Postgres is required for multi-node deployments (Kubernetes with multiple replicas, horizontal scaling). Set DATABASE_URL=postgresql://user:pass@host:5432/evaldb. Schema migrations run automatically on startup.


Serverless deployment

eval.ninja is a natural fit for serverless because evaluations are stateless — each request is independent. Serverless is especially effective for CI/CD use cases where you have bursty, unpredictable workloads.

Cold start consideration: The Docker image is ~500MB. Cold starts on Lambda or Cloud Run are typically 3–8 seconds. This is acceptable for CI pipelines (where latency doesn't affect users) but not for real-time production judging. Use a minimum instance count of 1 if you need sub-second first-request latency.

aws lambda create-function \
  --function-name eval-ninja \
  --package-type Image \
  --code ImageUri=evalninja/eval.ninja-lambda:latest \
  --role arn:aws:iam::123456:role/eval-ninja-lambda \
  --memory-size 1024 \
  --timeout 60 \
  --environment Variables='{LLM_PROVIDER=openai,OPENAI_API_KEY=sk-...}'

When serverless makes sense:

  • CI/CD pipelines that only evaluate on PR (spiky, not continuous)
  • Development environments where the instance can scale to zero
  • Multi-team setups where each team wants their own isolated endpoint without maintaining servers

When to use a persistent server instead:

  • Production sampling (1–5% of traffic) where cold starts would add unacceptable tail latency
  • High-volume batch evaluations where cold start overhead amortizes poorly
  • When you need persistent SQLite history (serverless requires external Postgres for persistent storage)

Upgrade and backup

# Docker: pull new image and restart
docker pull evalninja/eval.ninja:latest
docker restart eval-ninja

# Docker Compose
docker compose pull
docker compose up -d

# Kubernetes: update image tag and rollout
kubectl set image deployment/eval-ninja eval-ninja=evalninja/eval.ninja:1.2.0
kubectl rollout status deployment/eval-ninja

Backup is your SQLite file (if using the default) or your Postgres database. The SQLite file is at the path you configured with DATABASE_URL, or the default /app/data/eval.db inside the container. To backup:

# Backup SQLite from running container
docker exec eval-ninja sqlite3 /app/data/eval.db ".backup '/tmp/backup.db'"
docker cp eval-ninja:/tmp/backup.db ./eval-backup-$(date +%Y%m%d).db

Parity with managed cloud

Self-hosted Managed cloud
REST API✓ Identical✓ Identical
All metrics✓ Full suite✓ Full suite
Eval history✓ SQLite / Postgres✓ Managed DB
Judge models⚠ BYOK required✓ Managed, no key needed
Dashboard UI⚠ API only (UI roadmap)✓ Full dashboard
Team collaboration⚠ Bring your own auth✓ Built-in
Data residency✓ Your network only⚠ Processed on our infra
CostSubscription + compute + LLMCredit-based

Frequently asked questions

Self-hosting requires a subscription. Contact us at founders@eval.ninja for pricing. You also pay for your own compute (a t3.small is sufficient for most teams) and your LLM provider API calls. There is no per-eval fee for self-hosted deployments beyond the subscription.
Minimum: Docker 20.10+, 2GB RAM, 1 vCPU, outbound HTTPS to your LLM provider. Recommended for production: 4GB RAM, 2 vCPU, persistent volume for SQLite (or external Postgres for multi-node). The Docker image is approximately 500MB.
Yes, for any metrics that involve an LLM judge (faithfulness, answer relevance, context relevance, LLM-as-a-judge). eval.ninja uses BYOK — your key, your provider, no markup. If you need fully local evaluation with no external API calls, configure eval.ninja to use Ollama with a local model.
Pull the new image and restart the container. Schema migrations run automatically on startup. Your evaluation history is preserved in the SQLite file or Postgres database.
Open an issue on GitHub or email founders@eval.ninja. We respond within one business day.