How much does self-hosting eval.ninja cost?

Self-hosting eval.ninja requires a subscription. Contact us for pricing. You will also need API keys for your chosen LLM provider (OpenAI, Anthropic, or any OpenAI-compatible endpoint) since eval.ninja uses BYOK (bring your own key) and does not include a bundled judge model in the self-hosted image.

Does self-hosted eval.ninja support Kubernetes?

Yes. eval.ninja runs as a stateless Docker container and deploys to Kubernetes without modification. Mount a persistent volume for SQLite storage (or point to an external Postgres instance) and configure your LLM provider keys as Kubernetes secrets.

What is the difference between self-hosted and cloud eval.ninja?

The API surface is identical. Self-hosted requires you to bring your own LLM API keys (BYOK) — there are no shared judge models. Cloud includes managed judge models out of the box. Self-hosted uses SQLite by default for history; cloud has a managed database. Both support the same metrics and the same REST API.

How do I upgrade a self-hosted eval.ninja instance?

Pull the new image tag and restart the container. If you're using Docker: docker pull evalninja/eval.ninja:latest && docker restart eval-ninja. For Kubernetes: update the image tag in your deployment manifest and apply it. Your SQLite file or Postgres database is preserved between upgrades.

Self-Hosting eval.ninja: Complete Deployment Guide

Q: What are the system requirements for self-hosting eval.ninja?

Minimum: Docker 20.10+, 2GB RAM, 1 vCPU. Recommended for production: 4GB RAM, 2 vCPU, persistent volume for SQLite (or an external Postgres instance). The Docker image is ~500MB.

Self-hosting eval.ninja means your evaluation data never leaves your network. This guide covers every supported deployment target — from a one-line Docker command to a production Kubernetes deployment — with working configuration files for each.

Why self-host

There are three reasons teams choose self-hosted over the managed cloud:

Data residency. Your evaluation data contains your questions, answers, and retrieved context — essentially a representative sample of what your users are asking. Many compliance frameworks (HIPAA, SOC 2 Type II, GDPR for EU-resident data) require this data to stay within your infrastructure. Self-hosting is the cleanest path to satisfying those requirements.

BYOK without trusting external infra. eval.ninja uses BYOK — your LLM provider keys go directly from your environment to your provider. With self-hosted, the keys never touch our infrastructure at any point. With our cloud platform, we handle the key securely, but some security policies prohibit any third party from holding API credentials.

Cost predictability at scale. At high evaluation volumes, the cloud platform's per-eval pricing can exceed what you'd pay to run a small VM. A t3.medium on AWS (~$30/month) running the self-hosted image handles thousands of evaluations per day. You pay your LLM provider directly, with no markup.

Architecture overview

The self-hosted eval.ninja image contains:

API server — the same REST API as cloud, exposed on port 8080
Metric logic — faithfulness, answer relevance, context precision/recall, context relevance, LLM-as-a-judge
SQLite database (default) — stores evaluation history, run metadata

The image does not contain a judge model. All LLM calls go to your provider via the keys you configure. This is by design: the eval logic is ours; the model costs are yours, at your provider's rate with no markup.

Key boundary

eval.ninja → your LLM provider (OpenAI, Anthropic, Bedrock, Ollama...) | Data never flows outside this boundary.

Quickstart: one command

docker run -p 8080:8080 \
  -e LLM_PROVIDER=openai \
  -e OPENAI_API_KEY=sk-... \
  evalninja/eval.ninja:latest

Confirm it's running:

curl http://localhost:8080/v1/health
# {"status":"ok","version":"1.0.0"}

Run your first evaluation:

curl -X POST http://localhost:8080/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "What is the return policy?",
    "response": "Returns are accepted within 30 days.",
    "retrieved_contexts": ["Items may be returned within 30 days of purchase for a full refund."],
    "metrics": ["faithfulness"]
  }'
# Response shape: { "metrics": [{ "name", "score", "percentage", "interpretation", "error" }], "summary", "request" }

Deployment targets

Docker Compose

For local development or a single-server deployment with persistent history:

# docker-compose.yml
version: '3.8'
services:
  eval-ninja:
    image: evalninja/eval.ninja:latest
    ports:
      - "8080:8080"
    environment:
      - LLM_PROVIDER=openai
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DATABASE_URL=sqlite:///data/eval.db
    volumes:
      - eval-data:/app/data
    restart: unless-stopped

volumes:
  eval-data:

docker compose up -d

Kubernetes

# eval-ninja-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eval-ninja
spec:
  replicas: 2
  selector:
    matchLabels:
      app: eval-ninja
  template:
    metadata:
      labels:
        app: eval-ninja
    spec:
      containers:
        - name: eval-ninja
          image: evalninja/eval.ninja:latest
          ports:
            - containerPort: 8080
          env:
            - name: LLM_PROVIDER
              value: openai
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: eval-ninja-secrets
                  key: openai-api-key
            - name: DATABASE_URL
              value: postgresql://user:pass@postgres-service/evaldb
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
---
apiVersion: v1
kind: Service
metadata:
  name: eval-ninja-service
spec:
  selector:
    app: eval-ninja
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP

AWS ECS / Fargate

{
  "family": "eval-ninja",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [{
    "name": "eval-ninja",
    "image": "evalninja/eval.ninja:latest",
    "portMappings": [{"containerPort": 8080}],
    "environment": [
      {"name": "LLM_PROVIDER", "value": "openai"}
    ],
    "secrets": [
      {
        "name": "OPENAI_API_KEY",
        "valueFrom": "arn:aws:secretsmanager:us-east-1:123456:secret:eval-ninja/openai-key"
      }
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/eval-ninja",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }]
}

Google Cloud Run

gcloud run deploy eval-ninja \
  --image evalninja/eval.ninja:latest \
  --platform managed \
  --region us-central1 \
  --port 8080 \
  --memory 1Gi \
  --set-env-vars LLM_PROVIDER=openai \
  --set-secrets OPENAI_API_KEY=eval-ninja-openai-key:latest \
  --allow-unauthenticated

Fly.io

# fly.toml
app = "eval-ninja-yourname"
primary_region = "iad"

[build]
  image = "evalninja/eval.ninja:latest"

[http_service]
  internal_port = 8080
  force_https = true

[env]
  LLM_PROVIDER = "openai"

[[mounts]]
  source = "eval_data"
  destination = "/app/data"

fly secrets set OPENAI_API_KEY=sk-...
fly deploy

Plain VM with systemd

# /etc/systemd/system/eval-ninja.service
[Unit]
Description=eval.ninja evaluation server
After=docker.service
Requires=docker.service

[Service]
Restart=always
ExecStartPre=-/usr/bin/docker stop eval-ninja
ExecStartPre=-/usr/bin/docker rm eval-ninja
ExecStart=/usr/bin/docker run --rm \
  --name eval-ninja \
  -p 8080:8080 \
  -v /var/eval-ninja/data:/app/data \
  -e LLM_PROVIDER=openai \
  -e OPENAI_API_KEY=sk-... \
  evalninja/eval.ninja:latest
ExecStop=/usr/bin/docker stop eval-ninja

[Install]
WantedBy=multi-user.target

sudo systemctl enable eval-ninja
sudo systemctl start eval-ninja

Configuration reference

Variable	Default	Description
`LLM_PROVIDER`	openai	LLM provider: `openai`, `anthropic`, `bedrock`, `ollama`
`OPENAI_API_KEY`	—	Required if LLM_PROVIDER=openai
`ANTHROPIC_API_KEY`	—	Required if LLM_PROVIDER=anthropic
`JUDGE_MODEL`	gpt-4o-mini	Model to use for LLM-judged metrics
`DATABASE_URL`	sqlite:///data/eval.db	SQLite path or Postgres connection string
`PORT`	8080	HTTP port to listen on
`LOG_LEVEL`	info	`debug`, `info`, `warn`, `error`
`LOG_FORMAT`	json	`json` for structured logs, `text` for development
`OTEL_EXPORTER_OTLP_ENDPOINT`	—	OpenTelemetry collector endpoint for tracing
`MAX_CONCURRENT_EVALS`	10	Max parallel LLM calls per request
`RATE_LIMIT_RPM`	60	Requests per minute to your LLM provider

BYOK setup per provider

# OpenAI
-e LLM_PROVIDER=openai -e OPENAI_API_KEY=sk-...

# Anthropic
-e LLM_PROVIDER=anthropic -e ANTHROPIC_API_KEY=sk-ant-...

# AWS Bedrock (uses instance role or explicit keys)
-e LLM_PROVIDER=bedrock \
-e AWS_ACCESS_KEY_ID=... \
-e AWS_SECRET_ACCESS_KEY=... \
-e AWS_REGION=us-east-1 \
-e JUDGE_MODEL=anthropic.claude-3-haiku-20240307-v1:0

# Ollama (local, no external calls)
-e LLM_PROVIDER=ollama \
-e OLLAMA_BASE_URL=http://ollama:11434 \
-e JUDGE_MODEL=llama3.1:70b

SQLite vs Postgres

SQLite (default) is sufficient for single-node deployments handling up to ~10k eval runs/day. Mount a persistent volume so the file survives container restarts. Backup is a file copy.

Postgres is required for multi-node deployments (Kubernetes with multiple replicas, horizontal scaling). Set DATABASE_URL=postgresql://user:pass@host:5432/evaldb. Schema migrations run automatically on startup.

Serverless deployment

eval.ninja is a natural fit for serverless because evaluations are stateless — each request is independent. Serverless is especially effective for CI/CD use cases where you have bursty, unpredictable workloads.

Cold start consideration: The Docker image is ~500MB. Cold starts on Lambda or Cloud Run are typically 3–8 seconds. This is acceptable for CI pipelines (where latency doesn't affect users) but not for real-time production judging. Use a minimum instance count of 1 if you need sub-second first-request latency.

aws lambda create-function \
  --function-name eval-ninja \
  --package-type Image \
  --code ImageUri=evalninja/eval.ninja-lambda:latest \
  --role arn:aws:iam::123456:role/eval-ninja-lambda \
  --memory-size 1024 \
  --timeout 60 \
  --environment Variables='{LLM_PROVIDER=openai,OPENAI_API_KEY=sk-...}'

When serverless makes sense:

CI/CD pipelines that only evaluate on PR (spiky, not continuous)
Development environments where the instance can scale to zero
Multi-team setups where each team wants their own isolated endpoint without maintaining servers

When to use a persistent server instead:

Production sampling (1–5% of traffic) where cold starts would add unacceptable tail latency
High-volume batch evaluations where cold start overhead amortizes poorly
When you need persistent SQLite history (serverless requires external Postgres for persistent storage)

Upgrade and backup

# Docker: pull new image and restart
docker pull evalninja/eval.ninja:latest
docker restart eval-ninja

# Docker Compose
docker compose pull
docker compose up -d

# Kubernetes: update image tag and rollout
kubectl set image deployment/eval-ninja eval-ninja=evalninja/eval.ninja:1.2.0
kubectl rollout status deployment/eval-ninja

Backup is your SQLite file (if using the default) or your Postgres database. The SQLite file is at the path you configured with DATABASE_URL, or the default /app/data/eval.db inside the container. To backup:

# Backup SQLite from running container
docker exec eval-ninja sqlite3 /app/data/eval.db ".backup '/tmp/backup.db'"
docker cp eval-ninja:/tmp/backup.db ./eval-backup-$(date +%Y%m%d).db

Parity with managed cloud

	Self-hosted	Managed cloud
REST API	✓ Identical	✓ Identical
All metrics	✓ Full suite	✓ Full suite
Eval history	✓ SQLite / Postgres	✓ Managed DB
Judge models	⚠ BYOK required	✓ Managed, no key needed
Dashboard UI	⚠ API only (UI roadmap)	✓ Full dashboard
Team collaboration	⚠ Bring your own auth	✓ Built-in
Data residency	✓ Your network only	⚠ Processed on our infra
Cost	Subscription + compute + LLM	Credit-based

Frequently asked questions

Self-hosting requires a subscription. Contact us at founders@eval.ninja for pricing. You also pay for your own compute (a t3.small is sufficient for most teams) and your LLM provider API calls. There is no per-eval fee for self-hosted deployments beyond the subscription.

Minimum: Docker 20.10+, 2GB RAM, 1 vCPU, outbound HTTPS to your LLM provider. Recommended for production: 4GB RAM, 2 vCPU, persistent volume for SQLite (or external Postgres for multi-node). The Docker image is approximately 500MB.

Yes, for any metrics that involve an LLM judge (faithfulness, answer relevance, context relevance, LLM-as-a-judge). eval.ninja uses BYOK — your key, your provider, no markup. If you need fully local evaluation with no external API calls, configure eval.ninja to use Ollama with a local model.

Pull the new image and restart the container. Schema migrations run automatically on startup. Your evaluation history is preserved in the SQLite file or Postgres database.

Open an issue on GitHub or email founders@eval.ninja. We respond within one business day.

Self-Hosting eval.ninja: Complete Deployment Guide

Why self-host

Architecture overview

Quickstart: one command

Deployment targets

Docker Compose

Kubernetes

AWS ECS / Fargate

Google Cloud Run

Fly.io

Plain VM with systemd

Configuration reference

BYOK setup per provider

SQLite vs Postgres

Serverless deployment

Upgrade and backup

Parity with managed cloud

Frequently asked questions

Ready to run these metrics?