Promptfoo is a CLI-first open source eval tool with strong red teaming. eval.ninja is an API-first eval platform with hosted scoring, BYOK, and Docker self-host. Use Promptfoo if your workflow is local YAML configs and you don't need hosted history. Use eval.ninja if you want evals callable from any language, optional managed hosting, and zero application-layer latency.
Side-by-side comparison
Where Promptfoo wins
Mature red teaming module
Promptfoo's red teaming module is the most complete in the open source ecosystem. It includes attack strategies for jailbreaks, indirect prompt injection, harmful content generation, PII exfiltration, and more. eval.ninja does not have this today. If adversarial testing is part of your eval workflow, Promptfoo or a dedicated red teaming tool is the right choice for that layer.
Larger OSS community
Promptfoo has a large active community, extensive documentation, and many real-world examples. When you're stuck, there's likely a GitHub issue or Discord discussion that addresses it. eval.ninja is a younger product with less community surface area.
Zero cost for purely local use
Promptfoo is MIT-licensed and runs entirely locally. If you have a Python-isolated environment, are on a tight budget, and only need local eval runs with no hosted history, Promptfoo's local-first model costs nothing.
YAML config ergonomics for static test suites
If your test cases are stable, Promptfoo's YAML configuration is clean and version-control-friendly. Define providers, prompts, and assertions in one file that your whole team can read and modify without writing code.
Where eval.ninja wins
API-callable from any language
eval.ninja is a REST API. You can call it from Python, Go, Rust, Node, Ruby, Bash, no-code tools, or anything that can make an HTTP request. Promptfoo requires Node.js and its CLI. In polyglot teams or non-Node pipelines, this is a real friction difference.
let response = client
.post("https://api.eval.ninja/v1/evaluate")
.header("Authorization", format!("Bearer {}", key))
.json(&payload)
.send()
.await?; No application-layer overhead
eval.ninja's evaluation latency is exactly the judge model's latency — no wrapper overhead, no serialization penalty. The API server adds single-digit milliseconds. This matters when you're judging production traffic asynchronously and want tight SLAs on the eval pipeline itself.
Hosted option with managed judge models
The cloud platform includes managed judge models — you call the API and get a score. No BYOK required for the cloud option. Promptfoo's cloud platform is in beta with unclear pricing; the OSS version always requires you to supply provider keys.
Full RAG metric suite
eval.ninja implements the full RAGAS metric suite: faithfulness, answer relevance, context precision, context recall, context relevance, and more. Promptfoo's built-in assertions are more general-purpose (factuality, toxicity, similarity) and require more custom prompt engineering to replicate RAG-specific metrics.
Docker self-host for compliance
One docker run command and you have a self-hosted eval.ninja API in your own network. Promptfoo runs locally but doesn't expose an HTTP API — it's a CLI, not a server. For teams that need a shared eval endpoint across services or CI systems, eval.ninja's self-hosted model is a better fit.
When to choose which
- Need red teaming today? → Promptfoo (eval.ninja doesn't have this yet)
- Need a hosted dashboard for non-engineers to review results? → eval.ninja cloud
- Polyglot codebase, no Python/Node? → eval.ninja (REST API, any language)
- Prefer git-tracked YAML configs for static test suites? → Promptfoo
- Need full RAG metrics (faithfulness, context recall) out of the box? → eval.ninja
- OSS license required? → Promptfoo (MIT licensed)
- Need compliance-safe self-hosting with an HTTP API? → eval.ninja self-host
- Need both red teaming and quality metrics? → Use both. Many teams run Promptfoo for security tests and eval.ninja for quality metrics.
Migrating a Promptfoo config to eval.ninja
Here's a real Promptfoo config and the equivalent eval.ninja API call. Some things map cleanly; some don't.
providers:
- openai:gpt-4o
prompts:
- "Answer this question using only the provided context: {{question}}"
tests:
- vars:
question: "What is the return policy?"
context: "Items may be returned within 30 days of purchase."
assert:
- type: llm-rubric
value: "The answer should mention '30 days' and be grounded in the context"
- type: contains
value: "30 days" The equivalent eval.ninja call:
curl -X POST https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $EVAL_NINJA_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_input": "What is the return policy?",
"response": "Items may be returned within 30 days.",
"retrieved_contexts": ["Items may be returned within 30 days of purchase."],
"reference": "The return policy allows returns within 30 days of purchase.",
"metrics": ["faithfulness", "answer_relevancy"]
}' What maps cleanly: LLM-rubric assertions are closest to metrics like aspect_critic or grounding checks such as faithfulness (rubrics are expressed through metric definitions, not ad hoc JSON fields). Deterministic “does it mention X?” checks resemble string-similarity metrics such as exact_match or rouge_score. On managed cloud, judge models come from your organization configuration (JWT claims); self-hosted deployments configure providers via environment variables.
What doesn't map cleanly: Promptfoo's contains, regex, and javascript assertion types are deterministic checks that eval.ninja doesn't replicate (you'd handle these in your own test code). Promptfoo's YAML provider definitions don't have a direct analog — in eval.ninja, provider config is an environment variable, not a per-test config.