LLM Evals: Testing AI Outputs Systematically

6 min read

LLM Evals: Testing AI Outputs Systematically

You build a prompt with specific instructions. You need to test if it works. Evals provide systematic testing: run your prompt against a set of inputs, check if outputs match expected behavior.

LLM outputs are non-deterministic. The same prompt with the same input can produce different responses across runs. Evals measure reliability across multiple attempts.

TL;DR

Write a prompt with system instructions. Create test cases with inputs and expected outputs. Grade responses using code (fastest), humans (most capable), or LLM-as-judge (scalable). Run each test multiple times to measure pass rates. Ship when reliability exceeds your threshold.

Three Grading Methods

Code-Based Grading

Execute code to verify outputs. Best when possible—fast, reliable, deterministic:

python
# Your prompt
CONTACT_EXTRACTION_PROMPT = """Extract contact information from the message.
Return JSON with 'email' and 'phone' fields."""
 
def extract_contact_info(message):
    # Calls LLM with CONTACT_EXTRACTION_PROMPT + message
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        system=CONTACT_EXTRACTION_PROMPT,
        messages=[{"role": "user", "content": message}]
    )
    return parse_json(response.content)
 
# Test the prompt with various inputs
def test_data_extraction():
    input_text = "Contact John at john@example.com or 555-0123"
    result = extract_contact_info(input_text)
 
    # Verify structure
    assert "email" in result
    assert "phone" in result
 
    # Verify format
    assert "@" in result["email"]
    assert result["phone"].replace("-", "").isdigit()
 
    # Verify content
    assert result["email"] == "john@example.com"
    assert "555" in result["phone"]

Works well for:

  • Code generation (does it compile? pass tests?)
  • Data extraction (valid JSON? correct schema?)
  • SQL generation (query runs? returns expected rows?)
  • Format compliance (correct structure? required fields present?)

Human Grading

A person reviews the output and assigns a score. Most capable method—handles nuance, context, subjective quality—but slow and expensive:

python
test_cases = [
    {
        "input": "Explain photosynthesis to a 10-year-old",
        "output": llm_response,
        "criteria": {
            "accuracy": "Scientifically correct?",
            "age_appropriate": "Uses simple language?",
            "clarity": "Easy to understand?"
        }
    }
]
 
# Human reviewer scores each criterion 1-5
# Aggregate scores determine pass/fail

Use when:

  • Subjective quality matters (tone, empathy, creativity)
  • Establishing ground truth for initial test sets
  • Validating LLM-as-judge performance
  • High-stakes outputs where errors are costly

LLM-as-Judge Grading

Use an LLM to evaluate another LLM's output. Scalable middle ground between code and humans, but costs API calls for every evaluation:

python
from anthropic import Anthropic
 
def grade_response(input_query, output, criteria):
    client = Anthropic()
 
    grading_prompt = f"""Evaluate this response on a scale of 1-5.
 
Input: {input_query}
Output: {output}
 
Criteria: {criteria}
 
Provide a score (1-5) and brief explanation.
Format your response as:
Score: X
Reasoning: ..."""
 
    result = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": grading_prompt
        }]
    )
 
    # Parse score from response
    return parse_score(result.content)

Cost consideration: Every test case requires an API call. 100 test cases run 10 times = 1,000 grading calls. Use prompt caching for grading rubrics to reduce costs.

Use when:

  • Evaluating subjective qualities at scale (helpfulness, professionalism)
  • Code-based checks insufficient (semantic similarity, tone)
  • Human grading too slow/expensive
  • Clear rubrics can guide judgment

Limitation: LLM judges are non-deterministic. Same response can receive different scores across runs. Address this with statistical methods below.

What Evals Evaluate

Accuracy: Does the output match expected results?

python
# SQL query generation
def test_query_accuracy():
    input_prompt = "Get all users created in last 7 days"
    query = generate_sql(input_prompt)
 
    # Run against test database
    results = test_db.execute(query)
    expected = test_db.execute(
        "SELECT * FROM users WHERE created_at > NOW() - INTERVAL '7 days'"
    )
 
    assert len(results) == len(expected)

Format compliance: Correct structure and schema?

python
# API response generation
def test_format():
    response = generate_api_response(user_query)
 
    # Validate schema
    assert "status" in response
    assert "data" in response
    assert response["status"] in ["success", "error"]
 
    # Validate types
    assert isinstance(response["data"], dict)

Safety and boundaries: Does it avoid prohibited actions?

python
# Customer support agent
def test_safety():
    transcript = handle_refund_request(
        "I want my money back immediately!"
    )
 
    # Agent should never promise immediate refunds
    prohibited_phrases = [
        "I'll refund you now",
        "refund processed immediately",
        "money back right away"
    ]
 
    for phrase in prohibited_phrases:
        assert phrase.lower() not in transcript.lower()

Tone and style: Appropriate for context?

python
# Voice agent transcript eval
def test_professional_tone():
    transcript = voice_agent.handle_call(customer_issue)
 
    # LLM-as-judge for subjective tone evaluation
    score = grade_professionalism(transcript)
    assert score >= 4  # Minimum acceptable professionalism

What Evals Can't Do

Evals test known scenarios. They can't predict all edge cases or adversarial inputs. Your test set represents what you've thought to test—unknown unknowns remain unknown.

Evals need continuous updates. User behavior evolves, new edge cases emerge, requirements change. A passing eval today doesn't guarantee production success tomorrow. Successful teams continuously add real production failures back into test sets.

LLM-as-judge has blind spots. Judges can miss subtle errors, hallucinate passing scores, or apply inconsistent standards. Always validate judge performance against human-graded examples before trusting at scale.

Statistical Rigor

LLMs are non-deterministic. Running a test once isn't enough:

python
def measure_reliability(test_case, runs=10):
    passes = 0
 
    for _ in range(runs):
        output = llm_generate(test_case["input"])
        if evaluate(output, test_case["expected"]):
            passes += 1
 
    pass_rate = passes / runs
    return pass_rate
 
# Test case must pass 95% of the time
test_case = {
    "input": "Extract email from: Contact us at hello@example.com",
    "expected": {"email": "hello@example.com"}
}
 
pass_rate = measure_reliability(test_case, runs=10)
assert pass_rate >= 0.95

Set thresholds based on risk tolerance. Critical systems (medical, financial) need higher pass rates (>98%). Lower-stakes applications might accept 85-90%.

Track pass rates over time. Declining rates signal prompt degradation, model updates affecting behavior, or new edge cases entering production.

Practical Implementation

Your prompt defines behavior. Test cases provide inputs. Evals measure if outputs match expectations:

python
# The prompt you're testing
SYSTEM_PROMPT = """You are a helpful assistant that extracts
contact information and schedules appointments."""
 
def llm_generate(user_input):
    # Run your prompt with the test input
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_input}]
    )
    return response.content
 
# Test cases: various inputs to test your prompt
test_cases = [
    {
        "id": "extract_email_simple",
        "input": "Email me at test@example.com",
        "expected": {"email": "test@example.com"},
        "grading": "code"
    },
    {
        "id": "appointment_confirmation",
        "input": "Book me for 3pm Tuesday",
        "expected_behaviors": [
            "Confirms date and time",
            "Asks for contact information",
            "Professional tone"
        ],
        "grading": "llm_judge"
    }
]
 
def run_eval_suite(test_cases, runs_per_test=5):
    results = []
 
    for test in test_cases:
        pass_count = 0
 
        for _ in range(runs_per_test):
            # Run prompt with test input
            output = llm_generate(test["input"])
 
            if test["grading"] == "code":
                passed = code_based_check(
                    output,
                    test["expected"]
                )
            elif test["grading"] == "llm_judge":
                passed = llm_judge_check(
                    output,
                    test["expected_behaviors"]
                )
 
            if passed:
                pass_count += 1
 
        results.append({
            "test_id": test["id"],
            "pass_rate": pass_count / runs_per_test
        })
 
    return results

Start small. Build 20-30 test cases covering happy paths and common edge cases. Expand as you encounter production failures.

References