What is a Token?

January 24, 2025•2 min read

Definition

A token is the fundamental unit of text processing in large language models. Tokens can represent whole words, parts of words, individual characters, or subword units, depending on the tokenization algorithm. AI models like Claude Sonnet or OpenAI GPT-5 do not process raw text directly; instead, they convert text into tokens, which are then mapped to numerical vectors for computation.

The tokenization process splits text based on statistical patterns in training data. For optimization, common character sequences become single tokens, while rare words or specialized terms may split into multiple tokens.

Practical Examples

The sentence "I love strawberries" might tokenize as:

markdown

['I', ' love', ' straw', 'berries']

Note that spaces are often included as part of tokens, and words may split at morpheme boundaries.

It's important to understand tokens because they affect model behavior:

API costs are calculated per token
Context windows are measured in tokens, not characters
Different inputs with identical meaning may consume different token counts

When debugging unexpected model behavior, examining tokenization is often the first diagnostic step. Tools like OpenAI's Tokenizer or Anthropic's token counting APIs allow developers to visualize how their text will be processed.

Resources

OpenAI Tokenizer - Interactive tool for visualizing tokenization
Hugging Face Tokenizers Documentation - Comprehensive guide to tokenization algorithms
Anthropic's Token Counting Guide - Claude model token limits and counting