TL;DR

Promptfoo is a CLI-first open source eval tool with strong red teaming. eval.ninja is an API-first eval platform with hosted scoring, BYOK, and Docker self-host. Use Promptfoo if your workflow is local YAML configs and you don't need hosted history. Use eval.ninja if you want evals callable from any language, optional managed hosting, and zero application-layer latency.

Side-by-side comparison

eval.ninja Promptfoo
Setup docker run or sign up for cloud npm install -g promptfoo
Primary interface REST API CLI + YAML config
Language support Any — curl from anywhere Node-centric (JS/TS SDK)
Hosted option Yes — managed cloud No (Promptfoo Cloud in beta)
Self-host Yes — one Docker command Yes — run locally
BYOK Yes — all providers Yes — all providers
CI/CD integration REST API — any CI system CLI — works in CI, Node-required
Red teaming Not today Yes — mature module
RAG metrics Full RAGAS suite Limited
Eval history UI Yes — cloud + local (roadmap) Local viewer only
LLM-as-a-judge Yes — built-in, configurable Yes — llm-rubric provider
Pricing Free tier + usage-based; self-host free Free OSS; Promptfoo Cloud TBD
Open source Docker image free, not OSS MIT license

Where Promptfoo wins

Mature red teaming module

Promptfoo's red teaming module is the most complete in the open source ecosystem. It includes attack strategies for jailbreaks, indirect prompt injection, harmful content generation, PII exfiltration, and more. eval.ninja does not have this today. If adversarial testing is part of your eval workflow, Promptfoo or a dedicated red teaming tool is the right choice for that layer.

Larger OSS community

Promptfoo has a large active community, extensive documentation, and many real-world examples. When you're stuck, there's likely a GitHub issue or Discord discussion that addresses it. eval.ninja is a younger product with less community surface area.

Zero cost for purely local use

Promptfoo is MIT-licensed and runs entirely locally. If you have a Python-isolated environment, are on a tight budget, and only need local eval runs with no hosted history, Promptfoo's local-first model costs nothing.

YAML config ergonomics for static test suites

If your test cases are stable, Promptfoo's YAML configuration is clean and version-control-friendly. Define providers, prompts, and assertions in one file that your whole team can read and modify without writing code.


Where eval.ninja wins

API-callable from any language

eval.ninja is a REST API. You can call it from Python, Go, Rust, Node, Ruby, Bash, no-code tools, or anything that can make an HTTP request. Promptfoo requires Node.js and its CLI. In polyglot teams or non-Node pipelines, this is a real friction difference.

let response = client
    .post("https://api.eval.ninja/v1/evaluate")
    .header("Authorization", format!("Bearer {}", key))
    .json(&payload)
    .send()
    .await?;

No application-layer overhead

eval.ninja's evaluation latency is exactly the judge model's latency — no wrapper overhead, no serialization penalty. The API server adds single-digit milliseconds. This matters when you're judging production traffic asynchronously and want tight SLAs on the eval pipeline itself.

Hosted option with managed judge models

The cloud platform includes managed judge models — you call the API and get a score. No BYOK required for the cloud option. Promptfoo's cloud platform is in beta with unclear pricing; the OSS version always requires you to supply provider keys.

Full RAG metric suite

eval.ninja implements the full RAGAS metric suite: faithfulness, answer relevance, context precision, context recall, context relevance, and more. Promptfoo's built-in assertions are more general-purpose (factuality, toxicity, similarity) and require more custom prompt engineering to replicate RAG-specific metrics.

Docker self-host for compliance

One docker run command and you have a self-hosted eval.ninja API in your own network. Promptfoo runs locally but doesn't expose an HTTP API — it's a CLI, not a server. For teams that need a shared eval endpoint across services or CI systems, eval.ninja's self-hosted model is a better fit.


When to choose which

  • Need red teaming today? → Promptfoo (eval.ninja doesn't have this yet)
  • Need a hosted dashboard for non-engineers to review results? → eval.ninja cloud
  • Polyglot codebase, no Python/Node? → eval.ninja (REST API, any language)
  • Prefer git-tracked YAML configs for static test suites? → Promptfoo
  • Need full RAG metrics (faithfulness, context recall) out of the box? → eval.ninja
  • OSS license required? → Promptfoo (MIT licensed)
  • Need compliance-safe self-hosting with an HTTP API? → eval.ninja self-host
  • Need both red teaming and quality metrics? → Use both. Many teams run Promptfoo for security tests and eval.ninja for quality metrics.

Migrating a Promptfoo config to eval.ninja

Here's a real Promptfoo config and the equivalent eval.ninja API call. Some things map cleanly; some don't.

providers:
  - openai:gpt-4o

prompts:
  - "Answer this question using only the provided context: {{question}}"

tests:
  - vars:
      question: "What is the return policy?"
      context: "Items may be returned within 30 days of purchase."
    assert:
      - type: llm-rubric
        value: "The answer should mention '30 days' and be grounded in the context"
      - type: contains
        value: "30 days"

The equivalent eval.ninja call:

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "What is the return policy?",
    "response": "Items may be returned within 30 days.",
    "retrieved_contexts": ["Items may be returned within 30 days of purchase."],
    "reference": "The return policy allows returns within 30 days of purchase.",
    "metrics": ["faithfulness", "answer_relevancy"]
  }'

What maps cleanly: LLM-rubric assertions are closest to metrics like aspect_critic or grounding checks such as faithfulness (rubrics are expressed through metric definitions, not ad hoc JSON fields). Deterministic “does it mention X?” checks resemble string-similarity metrics such as exact_match or rouge_score. On managed cloud, judge models come from your organization configuration (JWT claims); self-hosted deployments configure providers via environment variables.

What doesn't map cleanly: Promptfoo's contains, regex, and javascript assertion types are deterministic checks that eval.ninja doesn't replicate (you'd handle these in your own test code). Promptfoo's YAML provider definitions don't have a direct analog — in eval.ninja, provider config is an environment variable, not a per-test config.


Frequently asked questions

No — eval.ninja is not open source. The self-hosted Docker image is freely available to pull and run, but the source code is not published. Promptfoo is MIT-licensed open source. If an OSS license is a hard requirement for your organization, Promptfoo is the right choice for that reason.
Not today. eval.ninja focuses on RAG quality metrics and LLM-as-a-judge for production quality evaluation. Promptfoo has a mature red teaming module and is the better choice for adversarial testing. Red teaming is on our roadmap.
Yes, and this is a common pattern. Promptfoo for security and adversarial tests; eval.ninja for quality metrics (faithfulness, context recall, answer relevance). They run independently and integrate into the same CI pipeline without conflict. Each tool is better at different things.
The Docker image has no per-eval fee. You pay for compute (a t3.small is sufficient for most teams, ~$15/month) and your LLM provider API calls at their standard rates. There is no eval.ninja markup on self-hosted LLM calls.
Promptfoo supports general LLM evaluation with custom rubrics, which you can write to evaluate RAG properties. But it doesn't have pre-built implementations of faithfulness, context recall, or context precision — you'd write these as llm-rubric assertions yourself. eval.ninja implements these metrics based on the RAGAS framework, so you get consistent, calibrated scores without writing judge prompts from scratch.