LLM Evals: Testing AI Outputs Systematically
LLM Evals: Testing AI Outputs Systematically
You build a prompt with specific instructions. You need to test if it works. Evals provide systematic testing: run your prompt against a set of inputs, check if outputs match expected behavior.
LLM outputs are non-deterministic. The same prompt with the same input can produce different responses across runs. Evals measure reliability across multiple attempts.
TL;DR
Write a prompt with system instructions. Create test cases with inputs and expected outputs. Grade responses using code (fastest), humans (most capable), or LLM-as-judge (scalable). Run each test multiple times to measure pass rates. Ship when reliability exceeds your threshold.
Three Grading Methods
Code-Based Grading
Execute code to verify outputs. Best when possible—fast, reliable, deterministic:
# Your prompt
CONTACT_EXTRACTION_PROMPT = """Extract contact information from the message.
Return JSON with 'email' and 'phone' fields."""
def extract_contact_info(message):
# Calls LLM with CONTACT_EXTRACTION_PROMPT + message
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
system=CONTACT_EXTRACTION_PROMPT,
messages=[{"role": "user", "content": message}]
)
return parse_json(response.content)
# Test the prompt with various inputs
def test_data_extraction():
input_text = "Contact John at john@example.com or 555-0123"
result = extract_contact_info(input_text)
# Verify structure
assert "email" in result
assert "phone" in result
# Verify format
assert "@" in result["email"]
assert result["phone"].replace("-", "").isdigit()
# Verify content
assert result["email"] == "john@example.com"
assert "555" in result["phone"]Works well for:
- Code generation (does it compile? pass tests?)
- Data extraction (valid JSON? correct schema?)
- SQL generation (query runs? returns expected rows?)
- Format compliance (correct structure? required fields present?)
Human Grading
A person reviews the output and assigns a score. Most capable method—handles nuance, context, subjective quality—but slow and expensive:
test_cases = [
{
"input": "Explain photosynthesis to a 10-year-old",
"output": llm_response,
"criteria": {
"accuracy": "Scientifically correct?",
"age_appropriate": "Uses simple language?",
"clarity": "Easy to understand?"
}
}
]
# Human reviewer scores each criterion 1-5
# Aggregate scores determine pass/failUse when:
- Subjective quality matters (tone, empathy, creativity)
- Establishing ground truth for initial test sets
- Validating LLM-as-judge performance
- High-stakes outputs where errors are costly
LLM-as-Judge Grading
Use an LLM to evaluate another LLM's output. Scalable middle ground between code and humans, but costs API calls for every evaluation:
from anthropic import Anthropic
def grade_response(input_query, output, criteria):
client = Anthropic()
grading_prompt = f"""Evaluate this response on a scale of 1-5.
Input: {input_query}
Output: {output}
Criteria: {criteria}
Provide a score (1-5) and brief explanation.
Format your response as:
Score: X
Reasoning: ..."""
result = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=500,
messages=[{
"role": "user",
"content": grading_prompt
}]
)
# Parse score from response
return parse_score(result.content)Cost consideration: Every test case requires an API call. 100 test cases run 10 times = 1,000 grading calls. Use prompt caching for grading rubrics to reduce costs.
Use when:
- Evaluating subjective qualities at scale (helpfulness, professionalism)
- Code-based checks insufficient (semantic similarity, tone)
- Human grading too slow/expensive
- Clear rubrics can guide judgment
Limitation: LLM judges are non-deterministic. Same response can receive different scores across runs. Address this with statistical methods below.
What Evals Evaluate
Accuracy: Does the output match expected results?
# SQL query generation
def test_query_accuracy():
input_prompt = "Get all users created in last 7 days"
query = generate_sql(input_prompt)
# Run against test database
results = test_db.execute(query)
expected = test_db.execute(
"SELECT * FROM users WHERE created_at > NOW() - INTERVAL '7 days'"
)
assert len(results) == len(expected)Format compliance: Correct structure and schema?
# API response generation
def test_format():
response = generate_api_response(user_query)
# Validate schema
assert "status" in response
assert "data" in response
assert response["status"] in ["success", "error"]
# Validate types
assert isinstance(response["data"], dict)Safety and boundaries: Does it avoid prohibited actions?
# Customer support agent
def test_safety():
transcript = handle_refund_request(
"I want my money back immediately!"
)
# Agent should never promise immediate refunds
prohibited_phrases = [
"I'll refund you now",
"refund processed immediately",
"money back right away"
]
for phrase in prohibited_phrases:
assert phrase.lower() not in transcript.lower()Tone and style: Appropriate for context?
# Voice agent transcript eval
def test_professional_tone():
transcript = voice_agent.handle_call(customer_issue)
# LLM-as-judge for subjective tone evaluation
score = grade_professionalism(transcript)
assert score >= 4 # Minimum acceptable professionalismWhat Evals Can't Do
Evals test known scenarios. They can't predict all edge cases or adversarial inputs. Your test set represents what you've thought to test—unknown unknowns remain unknown.
Evals need continuous updates. User behavior evolves, new edge cases emerge, requirements change. A passing eval today doesn't guarantee production success tomorrow. Successful teams continuously add real production failures back into test sets.
LLM-as-judge has blind spots. Judges can miss subtle errors, hallucinate passing scores, or apply inconsistent standards. Always validate judge performance against human-graded examples before trusting at scale.
Statistical Rigor
LLMs are non-deterministic. Running a test once isn't enough:
def measure_reliability(test_case, runs=10):
passes = 0
for _ in range(runs):
output = llm_generate(test_case["input"])
if evaluate(output, test_case["expected"]):
passes += 1
pass_rate = passes / runs
return pass_rate
# Test case must pass 95% of the time
test_case = {
"input": "Extract email from: Contact us at hello@example.com",
"expected": {"email": "hello@example.com"}
}
pass_rate = measure_reliability(test_case, runs=10)
assert pass_rate >= 0.95Set thresholds based on risk tolerance. Critical systems (medical, financial) need higher pass rates (>98%). Lower-stakes applications might accept 85-90%.
Track pass rates over time. Declining rates signal prompt degradation, model updates affecting behavior, or new edge cases entering production.
Practical Implementation
Your prompt defines behavior. Test cases provide inputs. Evals measure if outputs match expectations:
# The prompt you're testing
SYSTEM_PROMPT = """You are a helpful assistant that extracts
contact information and schedules appointments."""
def llm_generate(user_input):
# Run your prompt with the test input
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": user_input}]
)
return response.content
# Test cases: various inputs to test your prompt
test_cases = [
{
"id": "extract_email_simple",
"input": "Email me at test@example.com",
"expected": {"email": "test@example.com"},
"grading": "code"
},
{
"id": "appointment_confirmation",
"input": "Book me for 3pm Tuesday",
"expected_behaviors": [
"Confirms date and time",
"Asks for contact information",
"Professional tone"
],
"grading": "llm_judge"
}
]
def run_eval_suite(test_cases, runs_per_test=5):
results = []
for test in test_cases:
pass_count = 0
for _ in range(runs_per_test):
# Run prompt with test input
output = llm_generate(test["input"])
if test["grading"] == "code":
passed = code_based_check(
output,
test["expected"]
)
elif test["grading"] == "llm_judge":
passed = llm_judge_check(
output,
test["expected_behaviors"]
)
if passed:
pass_count += 1
results.append({
"test_id": test["id"],
"pass_rate": pass_count / runs_per_test
})
return resultsStart small. Build 20-30 test cases covering happy paths and common edge cases. Expand as you encounter production failures.
References
- Prompt Caching - Reduce LLM-as-judge evaluation costs
- Designing Error Messages for LLMs - Structure outputs for easier testing