What is runtime LLM evaluation?

Runtime LLM evaluation scores outputs produced in production, such as chatbot answers, agent actions, or generated sales messages. It complements design-time evals by checking the actual output created for a real user or workflow.

Should runtime evals block user responses?

Only for high-risk actions. Most teams sample production outputs asynchronously so users are not waiting for judge results. Blocking checks are best for actions like sending external messages, issuing refunds, or making policy-sensitive recommendations.

Which runtime LLM outputs should I evaluate?

Start with high-volume or high-risk outputs: support chatbot answers, RAG responses, sales outreach, lead qualification summaries, agent tool decisions, and any generated message sent outside your product.

Runtime LLM Evaluation: How to Score Production Outputs

Short answer

Design-time evals test changes before release. Runtime evals score what your app generated in production. Most can run in the background. Some should block risky messages before they are sent.

Why runtime evals matter

A prompt can pass a test suite and still fail in production. The user may ask a new question, retrieval may return weaker context, a CRM record may be stale, or the model may produce an answer that is plausible but unsupported.

Runtime evaluation checks the generated message itself: the chatbot reply, sales email, support summary, lead qualification note, agent action, or RAG answer that your app is about to use.

Use runtime evals where bad output matters

Support chatbots: score faithfulness, policy compliance, escalation quality, and whether the answer used the retrieved help-center context.
RAG search answers: monitor groundedness and context recall across live queries instead of only curated test questions.
Lead generation: verify that outbound messages are personalized from CRM evidence and do not invent account facts.
AI agents: judge whether a proposed tool call or workflow action is supported by the user request and business rules.
Sales and success summaries: check whether generated notes accurately reflect call transcripts, account data, or support history.

Async sampling for monitoring

Most runtime evals should not sit in the user-facing request path. Generate the answer, return it to the user, and score a percentage of traffic in the background. Store the score with the conversation or trace so failures can be debugged later.

import random
import asyncio
import httpx

EVAL_NINJA_KEY = "..."

async def score_chatbot_reply(question, answer, contexts, conversation_id):
    async with httpx.AsyncClient(timeout=15) as client:
        response = await client.post(
            "https://api.eval.ninja/v1/evaluate",
            headers={"Authorization": f"Bearer {EVAL_NINJA_KEY}"},
            json={
                "user_input": question,
                "response": answer,
                "retrieved_contexts": contexts,
                "metrics": ["faithfulness", "answer_relevancy"],
                "metadata": {"conversation_id": conversation_id, "surface": "support_chat"},
            },
        )
        response.raise_for_status()
        return response.json()

async def handle_message(question, conversation_id):
    contexts = await retrieve_context(question)
    answer = await generate_answer(question, contexts)

    # Runtime eval: sample live traffic without delaying the user response.
    if random.random() < 0.05:
        asyncio.create_task(score_chatbot_reply(question, answer, contexts, conversation_id))

    return answer

Sampling works well when the eval is used for monitoring. It can show score trends, catch drift, support prompt rollback decisions, and find new examples for your golden dataset.

Blocking checks for high-risk outputs

Some outputs should be evaluated before they are sent or executed. A generated email to a prospect, a policy-sensitive support response, a refund decision, or an agent tool call can cause real damage if it is wrong.

const result = await fetch("https://api.eval.ninja/v1/evaluate", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.EVAL_NINJA_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    user_input: "Draft a renewal email for Acme Corp.",
    response: generatedEmail,
    retrieved_contexts: [crmAccountNotes, productUsageSummary],
    metrics: ["faithfulness"],
    rubric: "Score whether every specific claim in the email is supported by the CRM and usage context."
  })
});

const body = await result.json();
const faithfulness = body.metrics.find((metric) => metric.name === "faithfulness");

if (faithfulness.score < 0.85) {
  await sendToHumanReview({ generatedEmail, evalResult: body });
} else {
  await queueEmail(generatedEmail);
}

Use blocking checks narrowly. They add model latency, so reserve them for outputs where the cost of a bad message is higher than the cost of waiting or routing to human review.

What to score at runtime

Faithfulness: every factual claim is supported by retrieved context, CRM data, transcript text, or another trusted source.
Answer relevance: the output responds to the user's actual request instead of drifting into generic help text.
Policy compliance: the output follows your product, legal, brand, or support escalation rules.
Completeness: the output includes required details such as next steps, caveats, source links, or qualification fields.
Action safety: the proposed action is justified by the user request and available evidence.

How this connects to design-time evals

Runtime evals should feed your design-time process. When a production output fails, add that example to your golden dataset, reproduce it in CI, and decide whether the fix belongs in retrieval, prompt design, model choice, or business rules.

The loop is simple. CI catches known regressions before deploy. Runtime sampling finds new failures after deploy. Those failures become future test cases.