Is eval.ninja open source?

eval.ninja is not fully open source. The self-hosted Docker image is freely available but source is not published. Promptfoo is MIT-licensed open source. If an OSS license is a hard requirement, Promptfoo is the right choice.

Does eval.ninja support red teaming?

Not today. eval.ninja focuses on quality evaluation metrics (faithfulness, relevance, context recall) and LLM-as-a-judge. Promptfoo has a mature red teaming module that tests for jailbreaks, harmful outputs, and adversarial inputs. If red teaming is a primary requirement, Promptfoo or a dedicated red teaming tool is the better fit.

Can I use both Promptfoo and eval.ninja together?

Yes — this is a common pattern. Teams often use Promptfoo for security and adversarial testing (where its red teaming module excels) and eval.ninja for production quality metrics (faithfulness, relevance, context recall) that need to run from CI pipelines in any language. The tools are complementary, not mutually exclusive.

Does self-hosting eval.ninja cost anything?

The self-hosted Docker image has no per-eval fee. You pay for your compute (typically $10–30/month for a small VM) and your LLM provider API calls at their standard rates. There is no eval.ninja markup on LLM calls in self-hosted mode.

Promptfoo is an open source CLI tool for testing and evaluating LLM applications. It uses YAML configuration files, supports many LLM providers, and includes a built-in red teaming module for testing adversarial inputs. It runs locally and is primarily configured via YAML, making it well-suited for teams with a Python/YAML-centric workflow.

eval.ninja vs Promptfoo (2025)

TL;DR

Promptfoo is a CLI-first open source eval tool with strong red teaming. eval.ninja is an API-first eval platform with hosted scoring, BYOK, and Docker self-host. Use Promptfoo if your workflow is local YAML configs and you don't need hosted history. Use eval.ninja if you want evals callable from any language, optional managed hosting, and zero application-layer latency.

Side-by-side comparison

	eval.ninja	Promptfoo
Setup	`docker run` or sign up for cloud	`npm install -g promptfoo`
Primary interface	REST API	CLI + YAML config
Language support	Any — curl from anywhere	Node-centric (JS/TS SDK)
Hosted option	Yes — managed cloud	No (Promptfoo Cloud in beta)
Self-host	Yes — one Docker command	Yes — run locally
BYOK	Yes — all providers	Yes — all providers
CI/CD integration	REST API — any CI system	CLI — works in CI, Node-required
Red teaming	Not today	Yes — mature module
RAG metrics	Full RAGAS suite	Limited
Eval history UI	Yes — cloud + local (roadmap)	Local viewer only
LLM-as-a-judge	Yes — built-in, configurable	Yes — llm-rubric provider
Pricing	Free tier + usage-based; self-host free	Free OSS; Promptfoo Cloud TBD
Open source	Docker image free, not OSS	MIT license

Where Promptfoo wins

Mature red teaming module

Promptfoo's red teaming module is the most complete in the open source ecosystem. It includes attack strategies for jailbreaks, indirect prompt injection, harmful content generation, PII exfiltration, and more. eval.ninja does not have this today. If adversarial testing is part of your eval workflow, Promptfoo or a dedicated red teaming tool is the right choice for that layer.

Larger OSS community

Promptfoo has a large active community, extensive documentation, and many real-world examples. When you're stuck, there's likely a GitHub issue or Discord discussion that addresses it. eval.ninja is a younger product with less community surface area.

Zero cost for purely local use

Promptfoo is MIT-licensed and runs entirely locally. If you have a Python-isolated environment, are on a tight budget, and only need local eval runs with no hosted history, Promptfoo's local-first model costs nothing.

YAML config ergonomics for static test suites

If your test cases are stable, Promptfoo's YAML configuration is clean and version-control-friendly. Define providers, prompts, and assertions in one file that your whole team can read and modify without writing code.

Where eval.ninja wins

API-callable from any language

eval.ninja is a REST API. You can call it from Python, Go, Rust, Node, Ruby, Bash, no-code tools, or anything that can make an HTTP request. Promptfoo requires Node.js and its CLI. In polyglot teams or non-Node pipelines, this is a real friction difference.

let response = client
    .post("https://api.eval.ninja/v1/evaluate")
    .header("Authorization", format!("Bearer {}", key))
    .json(&payload)
    .send()
    .await?;

No application-layer overhead

eval.ninja's evaluation latency is exactly the judge model's latency — no wrapper overhead, no serialization penalty. The API server adds single-digit milliseconds. This matters when you're judging production traffic asynchronously and want tight SLAs on the eval pipeline itself.

Hosted option with managed judge models

The cloud platform includes managed judge models — you call the API and get a score. No BYOK required for the cloud option. Promptfoo's cloud platform is in beta with unclear pricing; the OSS version always requires you to supply provider keys.

Full RAG metric suite

eval.ninja implements the full RAGAS metric suite: faithfulness, answer relevance, context precision, context recall, context relevance, and more. Promptfoo's built-in assertions are more general-purpose (factuality, toxicity, similarity) and require more custom prompt engineering to replicate RAG-specific metrics.

Docker self-host for compliance

One docker run command and you have a self-hosted eval.ninja API in your own network. Promptfoo runs locally but doesn't expose an HTTP API — it's a CLI, not a server. For teams that need a shared eval endpoint across services or CI systems, eval.ninja's self-hosted model is a better fit.

When to choose which

Need red teaming today? → Promptfoo (eval.ninja doesn't have this yet)
Need a hosted dashboard for non-engineers to review results? → eval.ninja cloud
Polyglot codebase, no Python/Node? → eval.ninja (REST API, any language)
Prefer git-tracked YAML configs for static test suites? → Promptfoo
Need full RAG metrics (faithfulness, context recall) out of the box? → eval.ninja
OSS license required? → Promptfoo (MIT licensed)
Need compliance-safe self-hosting with an HTTP API? → eval.ninja self-host
Need both red teaming and quality metrics? → Use both. Many teams run Promptfoo for security tests and eval.ninja for quality metrics.

Migrating a Promptfoo config to eval.ninja

Here's a real Promptfoo config and the equivalent eval.ninja API call. Some things map cleanly; some don't.

providers:
  - openai:gpt-4o

prompts:
  - "Answer this question using only the provided context: {{question}}"

tests:
  - vars:
      question: "What is the return policy?"
      context: "Items may be returned within 30 days of purchase."
    assert:
      - type: llm-rubric
        value: "The answer should mention '30 days' and be grounded in the context"
      - type: contains
        value: "30 days"

The equivalent eval.ninja call:

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "What is the return policy?",
    "response": "Items may be returned within 30 days.",
    "retrieved_contexts": ["Items may be returned within 30 days of purchase."],
    "reference": "The return policy allows returns within 30 days of purchase.",
    "metrics": ["faithfulness", "answer_relevancy"]
  }'

What maps cleanly: LLM-rubric assertions are closest to metrics like aspect_critic or grounding checks such as faithfulness (rubrics are expressed through metric definitions, not ad hoc JSON fields). Deterministic “does it mention X?” checks resemble string-similarity metrics such as exact_match or rouge_score. On managed cloud, judge models come from your organization configuration (JWT claims); self-hosted deployments configure providers via environment variables.

What doesn't map cleanly: Promptfoo's contains, regex, and javascript assertion types are deterministic checks that eval.ninja doesn't replicate (you'd handle these in your own test code). Promptfoo's YAML provider definitions don't have a direct analog — in eval.ninja, provider config is an environment variable, not a per-test config.

Frequently asked questions

No — eval.ninja is not open source. The self-hosted Docker image is freely available to pull and run, but the source code is not published. Promptfoo is MIT-licensed open source. If an OSS license is a hard requirement for your organization, Promptfoo is the right choice for that reason.

Not today. eval.ninja focuses on RAG quality metrics and LLM-as-a-judge for production quality evaluation. Promptfoo has a mature red teaming module and is the better choice for adversarial testing. Red teaming is on our roadmap.

Yes, and this is a common pattern. Promptfoo for security and adversarial tests; eval.ninja for quality metrics (faithfulness, context recall, answer relevance). They run independently and integrate into the same CI pipeline without conflict. Each tool is better at different things.

The Docker image has no per-eval fee. You pay for compute (a t3.small is sufficient for most teams, ~$15/month) and your LLM provider API calls at their standard rates. There is no eval.ninja markup on self-hosted LLM calls.

Promptfoo supports general LLM evaluation with custom rubrics, which you can write to evaluate RAG properties. But it doesn't have pre-built implementations of faithfulness, context recall, or context precision — you'd write these as llm-rubric assertions yourself. eval.ninja implements these metrics based on the RAGAS framework, so you get consistent, calibrated scores without writing judge prompts from scratch.