May 14, 20268 min read

Evals as a permanent system, not a launch checklist

Why eval suites that ship at launch and then atrophy are the single biggest cause of AI quality decay. The pattern we build instead.

evals
agents
production

The eval-at-launch trap

Most teams treat evals like a launch checklist. Write 20 test cases, run them, ship the model, move on. Six months later the model is silently degraded, the team doesn't know, and a customer escalation surfaces what should have been caught weeks earlier.

The right mental model: evals are a permanent system. Closer to your monitoring stack than to your test suite.

What permanent means in practice

Eval suites grow weekly. Every production failure we catch becomes a permanent case in the suite. The suite runs against every prompt change, every model change, and weekly against production output regardless.

Failing the eval blocks merge. Drift over time triggers a model review. The owner of the suite (yes, named owner) reports to the same review cadence as the rest of the engineering org.

The 20-case minimum, and the 200-case ceiling

20 hand-picked cases beat 200 synthetic ones. The 20 cover happy path, common edge cases, refusal patterns, and the patterns you can't afford to break. Below 20 you're not really testing. Above ~200 you're spending more cycles on eval maintenance than on the agent itself.

The healthy distribution we see in client projects: 60% real anonymized production cases, 30% adversarial / edge cases we wrote, 10% regression cases pulled from past bugs.

LLM-as-judge — where it works, where it breaks

LLM-as-judge is the only way to score evals at scale. For most criteria (faithfulness, refusal correctness, tone match) it works well enough.

It breaks on three things:

Subtle factual accuracy. When the ground truth requires domain expertise the judge model doesn't have, scores get noisy.
Subjective quality. "Is this tactful enough?" — humans disagree, and the judge model often agrees with whichever side phrased the prompt.
Multi-turn coherence. Judge models score individual turns better than whole conversations.

Fix: human review on the close calls. Score 80%+ of cases automatically with LLM-as-judge, route the bottom decile to a human. That's where tuning judgment lives.

Cost telemetry inside the eval suite

Every eval run logs token cost. You learn how much each test case costs and which ones are pulling the average up. Over time you spot "this prompt change improved quality but tripled cost" before the bill arrives.

We build this into every eval suite we ship now. It started as a defensive measure and turned into a useful product-economics signal.

Who owns it after we leave

This is the question we ask in every handoff. If the answer is "we'll figure it out", we know the eval suite will atrophy. If there's a named human with the eval review on their calendar, it usually survives.

We've started writing the named owner into the SOW. Not as a deliverable from us — as a commitment from the client. The clients who push back are the ones whose evals will die. We bring this up before signing, not after.