How Prompt Priming Shapes LLM Responses
How Prompt Priming Shapes LLM Responses
What Priming Does
Every token gets converted to a vector (a list of numbers representing meaning). The model uses these vectors to determine what comes next. When you start with "You are an expert Python developer," those tokens create vectors that cluster near Python concepts, best practices, and technical terminology.
The model's attention mechanism weights these priming vectors when generating each new token. This shifts the probability distribution—making Python-related tokens more likely, Java-related tokens less likely.
Real-World Example: AI Coding Tools
Claude Code and Cursor feel different from ChatGPT because they use different priming. Every conversation starts with a system prompt you never see.
Claude Code system prompt (simplified):
You are Claude Code, Anthropic's official CLI for Claude.
You are an interactive CLI tool that helps users with software engineering tasks.
Your responses should be short and concise.
Use specialized tools instead of bash commands when possible.This priming makes Claude Code:
- Use technical, engineer-to-engineer tone
- Prefer file operation tools over bash cat/grep commands
- Keep responses concise rather than conversational
- Focus on code rather than general chat
Cursor uses different priming:
You are an AI coding assistant integrated into a code editor.
You can see the user's current file and cursor position.
Suggest code completions and edits contextually.This priming makes Cursor:
- Provide inline suggestions rather than full explanations
- Reference the current file context
- Format responses as code edits
Same underlying model architecture, different behavior—because the priming vectors position each conversation in different regions of the embedding space.
When to Use Priming
- Setting technical depth for an entire conversation ("expert" vs "beginner")
- Enforcing consistent code style across multiple requests ("concise" vs "verbose with comments")
- Establishing domain-specific terminology (medical, legal, financial contexts)
- Controlling response format preferences (bullet points, paragraphs, code-only)
- Building AI tools with specific personalities (like Claude Code's concise engineer tone)
Why This Matters
Priming works because:
- Vector similarity determines token selection
- Early tokens influence attention weights for all subsequent tokens
- The model's embedding space clusters related concepts together
- Your priming establishes which cluster to sample from
Use priming to set:
- Technical depth ("expert" vs "beginner")
- Response style ("concise" vs "detailed explanations")
- Output format ("code only" vs "code with commentary")
- Domain context ("API documentation" vs "learning tutorial")
The technique works across all transformer-based models. System messages in APIs are essentially pre-applied priming that persists across the conversation.
References
- Efficient Estimation of Word Representations in Vector Space - Mikolov et al., Google (Word2Vec paper with famous king-queen vector arithmetic)
- The Illustrated Word2vec - Jay Alammar's visual guide to word embeddings and vector space
- TensorFlow Embedding Projector - Interactive 3D visualization of word embeddings
- Attention Is All You Need - Vaswani et al., Google Research
- How LLMs Think and Respond - Token generation and vectors
- Prompting Techniques That Actually Work - Practical prompting strategies