What is a Token?

2 min read

What is a Token?

Definition

A token is the fundamental unit of text processing in large language models. Tokens can represent whole words, parts of words, individual characters, or subword units, depending on the tokenization algorithm. AI models like Claude Sonnet or OpenAI GPT-5 do not process raw text directly; instead, they convert text into tokens, which are then mapped to numerical vectors for computation.

The tokenization process splits text based on statistical patterns in training data. For optimization, common character sequences become single tokens, while rare words or specialized terms may split into multiple tokens.

Practical Examples

The sentence "I love strawberries" might tokenize as:

markdown
['I', ' love', ' straw', 'berries']

Note that spaces are often included as part of tokens, and words may split at morpheme boundaries.

It's important to understand tokens because they affect model behavior:

  • API costs are calculated per token
  • Context windows are measured in tokens, not characters
  • Different inputs with identical meaning may consume different token counts

When debugging unexpected model behavior, examining tokenization is often the first diagnostic step. Tools like OpenAI's Tokenizer or Anthropic's token counting APIs allow developers to visualize how their text will be processed.

Resources