Ever wondered how AI models like ChatGPT actually read your prompts?
It’s not word by word. And it’s definitely not letter by letter.
The secret is something called a "token," and it’s one of the most important—and most misunderstood—concepts in AI.
Getting this right is the difference between paying $5 for a query that gives you garbage results, and paying 5 cents for one that gives you gold.
Ready to look under the hood?
Let's break it down.
So, What Exactly is a Token in the World of AI?
Think of tokens as the LEGO bricks of language for an AI.
When you give a Large Language Model (LLM) a prompt, it doesn't see words or sentences like we do. Instead, it breaks your text down into these little pieces called tokens.
These tokens can be a whole word, a part of a word, or even just a single character or punctuation mark.
For example, the sentence:
"The cat sat on the mat."
Might be broken down into these tokens:
["The", " cat", " sat", " on", " the", " mat", "."]
That’s 7 tokens. Simple enough.
But what about a more complex word?
"Unforgettable"
An LLM might see this as:
["Un", "forget", "able"]
Three tokens for one word. This is where it gets interesting. These smaller pieces, or subwords, are the key. They allow the model to understand grammar, context, and the relationships between different concepts. It can recognise that "un-" is a prefix that often negates something, and "-able" is a suffix that denotes a capability.
This is far more powerful than just looking at the word "unforgettable" as a single, opaque block.
"Wait, Why Not Just Use Words? Or Letters? Isn't it All 1s and 0s Anyway?"
This is a fantastic question, and it gets to the heart of why tokens are so clever.
Yes, at the very bottom layer, a computer only understands binary (1s and 0s). But to get from human language to binary, there are several layers of abstraction. Using tokens is a much more efficient and meaningful layer than just using words or letters. There’s something called Transformers, which allows computers to have some context around the words than merely seeing them as letters.
Here’s why tokens beat individual letters:
Imagine the model had to learn English one letter at a time. It would have to figure out that 'c', 'a', and 't' together mean a furry animal. Then it would have to learn that 'c', 'a', 'r' means a vehicle. This is incredibly inefficient. It would take an astronomical amount of data and processing power to learn the patterns of a language from scratch, letter by letter. It's like trying to understand a book by analyzing the ink molecules.And here’s why tokens beat whole words:
- Vocabulary Explosion: There are over 170,000 words in the English language, not to mention slang, names, and technical jargon. Creating a "dictionary" for every single word would be massive. Plus, what about new words? The model would be clueless if it encountered a word it hadn't seen before, like "yeet" or a new tech term.
- Understanding Relationships: By breaking words down (like
un-forget-able
), the model learns the morphology of the language. It understands that "running," "ran," and "runner" are all related to the core concept of "run." This is something a whole-word system would struggle with. It would see them as three completely separate things. - Handling Typos & Variations: If you type "runnning," a whole-word model might fail. A token-based model can see the token
["runn"]
and["ning"]
and make a much better guess at what you meant.
So, tokens are the perfect middle ground. They are small enough to be flexible and capture the building blocks of language, but large enough to be meaningful and efficient.
Okay, So How is a Token Actually Calculated? Why Isn't it a Fixed Size?
This is where it gets a bit technical, but the concept is cool. LLMs use an algorithm called a Tokenizer. The most common ones are things like Byte-Pair Encoding (BPE) or WordPiece (used by Google's BERT).
Here’s a 5-year-old’s explanation of how it works:
Imagine you have a giant book with every word ever written.
- Start with Characters: First, the tokenizer’s dictionary is just every individual character (
a, b, c, d...
,A, B, C...
,!, ?, ...
). - Find the Most Common Pair: The algorithm then reads the entire book and finds the pair of characters that appears most often. Let's say it's 't' and 'h'.
- Merge Them: It merges 't' and 'h' into a new token, "th", and adds it to its dictionary.
- Repeat, Repeat, Repeat: Now it scans the book again. Maybe the most common pair is now "th" and "e". So it creates a new token, "the", and adds that to the dictionary. Then maybe it finds "in" and "g" are super common, so it creates "ing".
It keeps doing this thousands of times. It merges common pairs of characters or existing tokens to create new, longer tokens.
This is why token length varies.
- Common words like
"the"
,"and"
,"you"
quickly become single tokens because they appear so frequently. - Less common words like
"lexicographer"
might be broken into["lex", "ico", "grapher"]
because those smaller parts (lex
,grapher
) appear in other words too (lexicon
,photographer
). - Total nonsense words like
"asdfghjkl"
would be broken down into individual letters["a", "s", "d", "f", "g", "h", "j", "k", "l"]
because those pairs never got merged.
This clever process creates a super-efficient dictionary that is biased towards the most common patterns in a language.
Why Do Different LLMs Have Different Token Sizes?
Because they were trained differently!
Each AI company (OpenAI, Google, Anthropic, etc.) creates its own tokenizer. They might use a slightly different algorithm (BPE vs. WordPiece) or, more importantly, they train it on a different massive dataset of text and code.
- An LLM trained heavily on scientific papers might have single tokens for words like
"hydrolysis"
or"isomorphism"
. - An LLM trained more on casual internet text from Reddit and Twitter might have single tokens for things like
"lol"
or"imo"
.
This is also why a sentence that is 10 tokens in OpenAI's GPT-4 might be 12 tokens in Google's Gemini or 9 tokens in Anthropic's Claude. Their "dictionaries" are unique to them.
This Explains Why They Charge Based on Tokens!
Exactly. You’ve got it.
Charging per token is the most direct way to measure the amount of computational work the model has to do.
Think of it like this: every single token you send to the model, and every single token it generates back, has to be processed. This processing involves incredibly complex calculations on powerful, expensive computer chips (GPUs).
- More tokens in your prompt (input) = more work the model has to do to understand your request.
- More tokens in its response (output) = more work the model has to do to generate the text.
So, charging per token is a fair and direct measure of resource consumption. It's like paying for electricity by the kilowatt-hour.
I Keep Hearing About "Transformers." What's That?
Okay, simplest possible explanation, as promised.
Imagine you're in a library and you ask the librarian for a book about "sad, blue kings who fought dragons."
A basic librarian might just look for books with the words "sad," "blue," "king," and "dragon."
A Transformer is like a magical librarian. It doesn't just see the words; it understands the relationships between them. It knows that "blue" in this context probably means sad, not the color. It pays special "attention" to how "king" relates to "fought" and how "fought" relates to "dragons."
The Transformer architecture is the underlying engine that allows the model to weigh the importance of every token relative to every other token in your prompt. It's what gives the LLM its sense of context and nuance.
That's it for now. We'll do a deep dive on this in another article, because it's a rabbit hole of its own!
How Do Tokens Relate to the "Context Window"?
This is a crucial concept.
The context window is simply the maximum number of tokens the model can "remember" at one time. It's the model's short-term memory.
Let's say a model has a context window of 4,000 tokens.
This means the total number of tokens in your prompt plus the tokens in the model's generated response cannot exceed 4,000.
If you have a long conversation with a chatbot, the entire conversation history is being fed back into the model with every new message you send.
Example:
- You: "My name is Bob." (5 tokens)
- AI: "Nice to meet you, Bob." (6 tokens)
- You: "What is my name?" (5 tokens)
To answer your second question, the model isn't just seeing "What is my name?". It's actually seeing this entire block of text:
"User: My name is Bob. Assistant: Nice to meet you, Bob. User: What is my name?"
(Total: 16 tokens)
It sees the full context. Now, imagine this conversation goes on for pages and pages. Once the total token count exceeds that 4,000-token limit, the model starts to forget the beginning of the conversation. This is why a chatbot might forget your name or instructions you gave it 30 minutes ago. The earliest tokens have been pushed out of its memory.
Larger context windows (some models now have 200,000 or even 1 million tokens!) are a huge deal because they allow the model to "read" and process entire books, codebases, or financial reports at once, leading to much deeper understanding and analysis.
What About Images, Audio, and Video? How Are They Tokenized?
This is the frontier of AI right now with multimodal models. The principle is similar, but the "tokenization" method is different.
- Images: An image isn't broken into word-like chunks. Instead, it's typically broken down into a grid of small patches. Think of it like turning a high-res photo into a mosaic of tiny square images. Each of these patches is then converted into a numerical representation (a vector) that acts as a "token." The model then learns the patterns and relationships between these image patch tokens, just like it learns relationships between text tokens.
- Audio: Audio is a waveform. It can be tokenized by taking very short snippets of the sound, say 25 milliseconds long. Each snippet is converted into a numerical representation, and these become the audio "tokens" that the model processes.
- Video: Video is just a sequence of images (frames) plus an audio track. So, it's tokenized by applying the image and audio tokenization methods to its component parts.
The magic is that these models are learning to map the "image tokens" and "audio tokens" to the same conceptual space as the "text tokens." This is how you can give it a picture of a cat and it can output the text ["a", " cat", " sitting", " on", " a", " windowsill"]
.
Is an Emoji a Single Token? 🤔
It depends! But usually, yes.
Most modern tokenizers are smart enough to recognize common emojis as single, distinct units. The tokenizer algorithm, during its training, would have seen the poop emoji 💩 so many times that it learned to make it a single token.
However, a very new or obscure emoji might be broken down into its underlying Unicode characters, which could be multiple tokens. But for the ones you use every day? It's safe to assume they are one token.
How Can I Count Tokens Before I Send a Prompt?
This is a pro move. Counting tokens helps you estimate costs and ensure your prompt fits within the context window.
You can't just count words. As we've seen, "Unforgettable"
is one word but could be three tokens. A good rule of thumb for English text is that 1 word is approximately 1.3 tokens.
But for an exact count, you need to use the specific tokenizer for the model you're using.
The best way: Most AI providers have online tools for this.
- OpenAI's Tokenizer: They have a web interface where you can paste your text and it will show you the exact tokens and count.
- Anthropic & Google: Also have similar tools available in their documentation.
If you're a developer, you can use libraries like tiktoken
(for OpenAI models) in your code to programmatically count tokens before making an API call.
How Can I Reduce the Number of Tokens in My Prompts? (And Save money ofc)
Token efficiency is a skill. Getting the same or better result with fewer tokens is the goal.
Here are the top strategies:
- Be Concise. Remove Fluff. Don't write "It would be greatly appreciated if you could possibly provide me with a list of..." Just write "List the..." The model doesn't have feelings and doesn't care about pleasantries.
- Use Shorter Words. "Utilize" vs. "use". "Subsequently" vs. "next".
- Iterate and Refine. Look at your prompt. Is every word doing a job? If not, cut it.
- Use Examples Wisely. For "few-shot" prompting (giving the AI a few examples to follow), make your examples as short and clear as possible.
- Process Data Beforehand. If you're analyzing text, remove boilerplate, HTML tags, or irrelevant sections before you send it to the LLM. Don't pay the AI to read stuff you don't care about.
The Ultimate Token-Saving Strategy: RAG
There's a more advanced technique that is crucial for building powerful AI applications: Retrieval-Augmented Generation (RAG).
Imagine you want to build a chatbot that can answer questions about your company's internal documents (hundreds of pages).
The Naive (and expensive) way: Stuff all 500 pages of documents into the prompt's context window every single time a user asks a question. This would be millions of tokens and incredibly expensive and slow.
The RAG (and smart) way:
- Index: You first use a cheaper AI model to "index" all your documents and store them in a special database (a vector database).
- Retrieve: When a user asks a question, like "What was our Q3 revenue?", you first do a quick, cheap search in your database to find the most relevant paragraphs from your documents.
- Augment & Generate: You then feed only those few relevant paragraphs to the expensive, powerful LLM along with the user's question. Your prompt looks like this:
"Using the following context, answer the user's question.
Context: [Paste the 2-3 relevant paragraphs about Q3 revenue here]
User Question: What was our Q3 revenue?"
You've just reduced a 500-page, multi-million token problem into a prompt that's maybe 500 tokens long. It's faster, thousands of times cheaper, and often gives more accurate answers.
RAG is a foundational concept for building serious AI tools. If you're interested in learning how to implement it, you should definitely read our deep-dive article on RAG here.
Will Tokens Always Be a Thing? Or is a "Token-Free" World Coming?
This is a fascinating question about the future of AI.
For the foreseeable future, yes, tokens (or a very similar concept) are here to stay. They are a fundamentally efficient way to bridge the gap between continuous human language and the discrete, mathematical world of a computer.
However, the research is always pushing forward. Some future architectures might operate on a more "continuous" understanding of data, potentially moving beyond discrete tokens. But for now, and for the next several years, understanding tokens is the key to mastering LLMs.
The models will get better, the tokenizers will get more efficient, and the context windows will get larger, but the core principle will remain. The person who understands how to "speak token" to an AI will always have an advantage.