What is a Token?
What is a Token?
Definition
A token is the fundamental unit of text processing in large language models. Tokens can represent whole words, parts of words, individual characters, or subword units, depending on the tokenization algorithm. AI models like Claude Sonnet or OpenAI GPT-5 do not process raw text directly; instead, they convert text into tokens, which are then mapped to numerical vectors for computation.
The tokenization process splits text based on statistical patterns in training data. For optimization, common character sequences become single tokens, while rare words or specialized terms may split into multiple tokens.
Practical Examples
The sentence "I love strawberries" might tokenize as:
['I', ' love', ' straw', 'berries']Note that spaces are often included as part of tokens, and words may split at morpheme boundaries.
It's important to understand tokens because they affect model behavior:
- API costs are calculated per token
- Context windows are measured in tokens, not characters
- Different inputs with identical meaning may consume different token counts
When debugging unexpected model behavior, examining tokenization is often the first diagnostic step. Tools like OpenAI's Tokenizer or Anthropic's token counting APIs allow developers to visualize how their text will be processed.
Resources
- OpenAI Tokenizer - Interactive tool for visualizing tokenization
- Hugging Face Tokenizers Documentation - Comprehensive guide to tokenization algorithms
- Anthropic's Token Counting Guide - Claude model token limits and counting