If you've used Claude, ChatGPT, or Gemini, you've used an LLM, a Large Language Model. Understanding what they actually do under the hood turns out to be useful, because it explains why they're great at some things, bad at others, and occasionally confidently wrong.
No math. No ML jargon. Just what's actually happening.
The Core Mechanic: Predicting the Next Word
An LLM has one basic job: given a sequence of words so far, predict what word should come next.
That's it. That's the whole trick.
You type a prompt. The model predicts the most likely first word of a response, writes it, then uses everything it's seen so far (your prompt + that first word) to predict the next word. Then the next. Then the next. One piece at a time, until it decides the response is done.
The "magic" comes from scale. Train a model large enough, on enough text, and "predict the next word" turns out to require a remarkably broad understanding of language, logic, and the world. The model has to know grammar, vocabulary, facts, conventions, styles, reasoning patterns, all of it, to predict the right next word across every domain it's ever seen.
Tokens: The Actual Units
One piece of vocabulary worth learning: tokens.
LLMs don't think in words exactly. They think in tokens, chunks of text that are usually about 3/4 of a word on average. Some words are one token, some long words are two or three.
Why this matters for you:
- Pricing is per token, so longer prompts and longer responses cost more
- Context windows are measured in tokens, a 200K context window holds roughly 150,000 words of input + output combined
- Rate limits are usually expressed as tokens per minute
Practically: 1,000 tokens is roughly 750 words. A single-page document is ~500 tokens. A 50-page PDF is ~25,000 tokens.
Why LLMs Sometimes Make Things Up
The technical term is hallucination. It happens when an LLM produces something that sounds right but isn't factually accurate, made-up citations, fabricated quotes, invented statistics, non-existent features.
Here's why this isn't a bug you can fully patch:
The model isn't looking things up. It's predicting what sounds plausible based on patterns in its training data. When you ask "What's the name of the CEO of [small company]?" and the answer wasn't well-represented in training, the model doesn't have a way to say "I don't know what's probable here." It produces its best guess and presents it confidently.
The practical rule: never trust an LLM's factual output about something specific without verification. The model is wildly useful for reasoning, drafting, and pattern-matching. It is an unreliable source of facts unless you've given it the source material directly (see what is RAG).
Context Windows: How Much the Model Can "See"
The context window is the maximum amount of text the model can process in a single request. This includes your prompt, any documents you've attached, the conversation history, and the generated response, all of it combined.
Current state of the art:
Larger context windows enable capabilities that weren't practical a year ago, you can paste an entire repository and ask questions, or dump a 40-hour transcript archive and ask for patterns. But larger isn't automatically better. Bigger context means slower inference and higher cost. Use the window you actually need.
What's Actually Happening When You Chat With One
A useful mental model:
- You type a prompt.
- The model reads your prompt + any system instructions + any attached knowledge, all together.
- The model generates a response token-by-token, predicting each one based on everything it's seen so far.
- When you reply, the model reads the entire conversation again from the top, it has no memory between messages besides what's visible in the context window.
- Once the conversation exceeds the context window, the earliest messages get dropped. The model forgets them.
This is why a good system prompt (see how to write a prompt) matters so much: it's always in the model's view. A throwaway detail you mentioned 20 messages ago is probably gone.
Related Reading
Ready to put this into practice? The Claude Project Starter Pack has 6 role-specific configurations you can paste and use immediately.