You’ve built something with an LLM inside it, maybe vibe-coded over a weekend, a support bot, a doc search, a step that pulls fields out of an email, and it works when you try it. Every time. The problem is, LLMs are the reigning champions of “it works on my computer”. There are at least a couple of reasons it might not work for the next person.

First, the models sometimes hallucinate: we never train these things to say they don’t know, and they have no signal for the edge of their own knowledge, so a question that runs past what they learned comes back as the most plausible answer in the same confident tone as a correct one.

Second, they’re stochastic: there’s randomness in how each word gets picked, so the same input gives a slightly different output every run. You can get it right on the three prompts you tried and wrong on the fourth you didn’t.

So eyeballing it doesn’t scale, and it doesn’t prove a lot even when it passes. How do you actually know it works?

That’s what evals are for. You collect a set of inputs you already know the right answers to, run them through your system, and score each result automatically. Now you have a number, the share that passed, and every time you change a prompt, swap a model, or touch a retrieval step, you re-run the set and see whether it went up or down. A regression test for behaviour that isn’t deterministic.

The challenge is deciding what good means

The harness is simple. An eval is a list of cases and a script that runs them against your product. Say you’ve built a knowledge agent that answers questions from your company’s docs. You just ask it all the questions. The tooling is not where the time goes.

The time goes into the cases, because each case is a decision about what “good” means, written down. A case is three things: an input, the output you expect, and a way to check the result against it. The thing to keep in mind though, is that an eval doesn’t just evaluate the correctness of the answer, it should evaluate the format, the tone of voice, length of the answer, even the language potentially.

Getting the answerable questions right is actually the easy part. The key to a good eval is the questions with no answer, because a system that always answers looks great in a demo and then makes things up the first time a real user goes off-script. There’s no magic ratio of answerable to unanswerable questions, but a quarter to a third is a decent place to start, enough that the score drops when the system answers something it should have refused. Cover the kinds you’ll actually get: the plausible-but-absent (the warranty on the cucumber), the almost-relevant (a competitor’s product), and the deliberately leading (“you do offer a 30-day refund, right?”). If you write nothing else, write the refusals.

Grade with the cheapest thing that works

Now you need to turn each result into pass or fail automatically, and you reach for the cheapest check that captures what you mean by good.

Some of the time that’s a plain rule. If the right answer is a date, a number, an ID, or whether a specific source got cited, a string or regex match settles it for free, instantly, and the same way every time. Don’t pay a model to check something a regex can check.

You only need an LLM when correctness comes down to meaning rather than a token. A rule can confirm a date or a number, but it can’t see that “one person decides” and “it isn’t run by consensus” are the same answer, or that “opening the case voids the warranty” and “you lose cover if you take the lid off” say the same thing. There’s no shared word to match on. For those you use a second model as the grader, what people call an LLM-as-a-judge, that reads the expected answer and the actual answer and returns pass or fail.

The obvious worry is that the judge is an LLM too, so why would its verdicts be any steadier than the thing it’s grading. The answer is temperature. Temperature is the dial for how much randomness goes into picking each word, and at zero the model stops sampling and takes the most likely token every time, so it grades the same answer the same way twice. You run your agent at the temperature it’ll ship with, because you want to measure its real behaviour, and pin the judge at zero so the only variation left in the score comes from the system, not the ruler. Zero isn’t perfectly deterministic, batching and hardware still cause some drift, but it’s stable enough for these purposes.

The one thing the judge does not do is decide your standard. It applies the rubric you hand it. Tell it the wording has to match exactly and it will fail two perfectly correct answers. Tell it that key facts anywhere count, phrasing is irrelevant, and an omitted year still matches, and the same answers pass. The model didn’t get smarter between those runs, it just executed on your definition of good.

You don’t have to blend this into one number. Keep a separate score per dimension, correctness, format, tone, so a drop tells you which one slipped instead of just that something did. They can run on different clocks too: the cheap deterministic checks on every commit, the fuzzy LLM-graded ones less often, since they cost money and wobble. Don’t average them, a wrong fact and an off tone aren’t the same problem.

Break it on purpose

A green run only means something if a real problem can turn it red, so test the test. Change one thing you know should hurt quality and confirm the score drops. I shrank the chunks of text my agent retrieves until they were too small to hold a full answer, re-ran, and watched the score drop by about a third.

What’s useful is how it breaks: the document questions failed while the refusals and the live-data lookups still passed, which points straight at retrieval, the thing I’d actually broken. A score that only says “worse” leaves you guessing. One that fails in a pattern tells you where to look. Run the set on every meaningful change, the same way you’d run unit tests.

When the script starts to creak, tools like Braintrust do this at scale, but it’s the same handful of ideas with better tooling around them.

Taste is where it gets fuzzy

Checking that your support bot quoted the right refund policy is easy, a fact is in the docs or it isn’t. Deciding whether the reply was actually helpful and in your brand’s tone is the hard end, because there’s no single right answer to write down in advance. Hard isn’t impossible, though. You capture taste the way a style guide does, or the way you’d teach an assistant your brand voice: write down the rules you can actually name, back them with examples of good and bad replies, and hand both to the judge so it grades by analogy instead of against a gold answer. It works, but the instrument is fuzzier, the scores are noisier, you trust them less, and you spot-check by hand more often.

For facts, defining good is labour-intensive but has a clear finish line, you list the cases and stop. For taste, defining good is just having taste, and you can only write down the part you already know how to put into words.