Debugging LLMs: Understanding Attention, Tokens, and Context

January 22, 2025•3 min read

When an LLM behaves inconsistently with similar inputs, most engineers try rephrasing the prompt. That works sometimes. But if you're building production systems or integrating models into products, you need to debug systematically. This means understanding how modern LLMs work under the hood.

What You Need to Know

A token is the basic unit of text that LLMs process—roughly a word or part of a word. The model doesn't see characters or words; it sees tokens.

Modern LLMs like Claude and GPT are built on transformer architecture. The key innovation: self-attention, which lets each part of the input consider every other part when processing. This replaced older sequential models (RNNs) that processed text word-by-word, making transformers both faster and better at understanding relationships across long distances in text. For a deeper explanation of how this works, see How LLMs Think and Respond.

Self-attention means each token can "attend to" other tokens, weighing their relevance. This is why position matters in prompts—the model assigns different attention weights based on where information appears.

Tokenization converts text into these chunks. "user_id:12345" might become ["user", "_", "id", ":", "123", "45"]—six tokens, not one. Inconsistent tokenization causes inconsistent behavior.

Context windows limit how much input the model processes at once. Claude can handle 200K tokens, but that includes your prompt, system message, conversation history, and the response.

Tokenization Debugging

When model behavior differs between similar inputs, visualize the tokens first. Use Anthropic's tokenizer tool or check programmatically with the Claude SDK's count_tokens() method.

Common issues:

Formatting inconsistencies: $100 vs $ 100 vs 100 dollars tokenize differently
Whitespace: Leading/trailing spaces create different tokens (" hello" vs "hello")
Case sensitivity: API vs api vs Api all split differently
Special characters: \n\n vs \n \n create different token sequences

Quick fixes:

python

# Normalize inputs before sending to model
def normalize_input(text):
    text = text.strip()  # Remove leading/trailing whitespace
    text = " ".join(text.split())  # Normalize internal whitespace
    return text
 
# Check token counts programmatically
from anthropic import Anthropic
 
client = Anthropic()
token_count = client.count_tokens("your input text here")
print(f"Token count: {token_count}")

When debugging, compare tokenization between working and failing inputs. The difference usually reveals the issue.

Attention and Context

Attention patterns: If the model focuses on the wrong part of your input, restructure with explicit markers. Put critical information at the beginning and end—these positions typically receive higher attention weights. Use system messages to set context that should influence all responses.

Context limits: If inputs exceed the model's context window, Claude truncates or refuses. Check token counts for your full conversation history, not just individual messages. Structure long documents to fit within limits, or use retrieval techniques to fetch only relevant sections.

Understanding these internals turns debugging from guesswork into engineering.

References

Attention Is All You Need - Vaswani et al., 2017. The paper that introduced transformer architecture.
Anthropic Token Counting - Documentation on counting tokens with Claude
Claude API Reference - Official API documentation