How LLMs Think and Respond

January 22, 2025•3 min read

LLMs generate text one token at a time. A token is a chunk of text that could be a whole word, part of a word, or a single character. The tokenizer learned which sequences appear together frequently during training—common words like "the" become single tokens, while less common words split into smaller pieces.

"I love strawberries" becomes ["I", " love", " straw", "berries"]—four tokens. "strawberries" splits into "straw" and "berries" because that's how the tokenizer learned to handle that word.

To predict the next token, the model needs to understand what those tokens mean. So it converts each one into a vector—a list of numbers that represents that token's meaning. "Cat" and "dog" get similar numbers because they're both animals. "Cat" and "planet" get very different numbers.

Now the model has your input as a series of number lists. To generate a response, it looks at all those numbers and calculates: which past tokens should influence what I generate next? This is attention. It's math that assigns importance scores.

Using those scores, the model predicts probabilities for every possible next token. Then it picks one, adds it to the context, and repeats.

Vectors Encode Meaning

Each token becomes a vector—typically a list of hundreds or thousands of numbers. These numbers position the token in a mathematical space where similar meanings cluster together.

The classic example: vector arithmetic works on meaning. king - man + woman ≈ queen. The numbers representing "king" minus the numbers for "man" plus the numbers for "woman" lands near the numbers for "queen". The model learned these relationships from patterns in training data.

This is how the model processes language. Text becomes math. Semantic relationships become distances and directions in numerical space.

Attention Weighs Context

When predicting the next token, the model doesn't treat all previous tokens equally. Attention calculates relevance scores between each position and every other position.

For each token being generated, attention determines: which past tokens matter most for this prediction? The model assigns importance scores, creating a weighted representation of the context.

Transformers can reference any part of the input directly. Token 1 and token 1000 have the same connection strength. This is different from older sequential models (RNNs) that passed information step-by-step, degrading signals across long distances. Transformers eliminated that degradation.

Predicting the Next Token

Using the weighted context from attention, the model outputs a probability distribution over all possible next tokens in its vocabulary. Every token gets a probability score.

The sampling strategy determines which token gets picked:

Greedy: always pick the highest probability
Temperature: add randomness by sampling from top probabilities
Top-k/top-p: limit sampling to most likely candidates

The chosen token gets added to the context. The process repeats. Each new token influences the probabilities for the next one. The full response builds one token at a time.

Why This Matters

Understanding this pipeline explains model behavior. When outputs seem inconsistent, check tokenization—different splits create different vector sequences. When the model ignores context, attention weights might be distributing incorrectly. When responses feel repetitive, sampling parameters need adjustment.

Text becomes vectors, attention weighs which vectors matter for predicting the next token, and transformers can reference any part of the input without degradation. That's the core mechanism.

References

Attention Is All You Need - Vaswani et al., 2017. The transformer architecture paper.
The Illustrated Transformer - Visual explanation of transformer mechanics
Anthropic's Claude Documentation - API reference and guides