Debugging LLMs: Understanding Attention, Tokens, and Context
Debugging LLMs: Understanding Attention, Tokens, and Context
When an LLM behaves inconsistently with similar inputs, most engineers try rephrasing the prompt. That works sometimes. But if you're building production systems or integrating models into products, you need to debug systematically. This means understanding how modern LLMs work under the hood.
What You Need to Know
A token is the basic unit of text that LLMs process—roughly a word or part of a word. The model doesn't see characters or words; it sees tokens.
Modern LLMs like Claude and GPT are built on transformer architecture. The key innovation: self-attention, which lets each part of the input consider every other part when processing. This replaced older sequential models (RNNs) that processed text word-by-word, making transformers both faster and better at understanding relationships across long distances in text. For a deeper explanation of how this works, see How LLMs Think and Respond.
Self-attention means each token can "attend to" other tokens, weighing their relevance. This is why position matters in prompts—the model assigns different attention weights based on where information appears.
Tokenization converts text into these chunks. "user_id:12345" might become ["user", "_", "id", ":", "123", "45"]—six tokens, not one. Inconsistent tokenization causes inconsistent behavior.
Context windows limit how much input the model processes at once. Claude can handle 200K tokens, but that includes your prompt, system message, conversation history, and the response.
Tokenization Debugging
When model behavior differs between similar inputs, visualize the tokens first. Use Anthropic's tokenizer tool or check programmatically with the Claude SDK's count_tokens() method.
Common issues:
- Formatting inconsistencies: $100vs$ 100vs100 dollarstokenize differently
- Whitespace: Leading/trailing spaces create different tokens (" hello"vs"hello")
- Case sensitivity: APIvsapivsApiall split differently
- Special characters: \n\nvs\n \ncreate different token sequences
Quick fixes:
# Normalize inputs before sending to model
def normalize_input(text):
    text = text.strip()  # Remove leading/trailing whitespace
    text = " ".join(text.split())  # Normalize internal whitespace
    return text
 
# Check token counts programmatically
from anthropic import Anthropic
 
client = Anthropic()
token_count = client.count_tokens("your input text here")
print(f"Token count: {token_count}")When debugging, compare tokenization between working and failing inputs. The difference usually reveals the issue.
Attention and Context
Attention patterns: If the model focuses on the wrong part of your input, restructure with explicit markers. Put critical information at the beginning and end—these positions typically receive higher attention weights. Use system messages to set context that should influence all responses.
Context limits: If inputs exceed the model's context window, Claude truncates or refuses. Check token counts for your full conversation history, not just individual messages. Structure long documents to fit within limits, or use retrieval techniques to fetch only relevant sections.
Understanding these internals turns debugging from guesswork into engineering.
References
- Attention Is All You Need - Vaswani et al., 2017. The paper that introduced transformer architecture.
- Anthropic Token Counting - Documentation on counting tokens with Claude
- Claude API Reference - Official API documentation