What Is a Large Language Model (LLM)? A Plain-English Explainer

If you've used Claude, ChatGPT, or Gemini, you've used an LLM, a Large Language Model. Understanding what they actually do under the hood turns out to be useful, because it explains why they're great at some things, bad at others, and occasionally confidently wrong.

No math. No ML jargon. Just what's actually happening.

The Core Mechanic: Predicting the Next Word

An LLM has one basic job: given a sequence of words so far, predict what word should come next.

That's it. That's the whole trick.

You type a prompt. The model predicts the most likely first word of a response, writes it, then uses everything it's seen so far (your prompt + that first word) to predict the next word. Then the next. Then the next. One piece at a time, until it decides the response is done.

Trillions

Words in training

Modern LLMs are trained on enormous corpora: books, articles, code, websites

Billions

Parameters

The internal weights the model adjusts during training, this is where knowledge lives

200K–2M

Context window

The maximum text the model can consider at once, measured in tokens

The "magic" comes from scale. Train a model large enough, on enough text, and "predict the next word" turns out to require a remarkably broad understanding of language, logic, and the world. The model has to know grammar, vocabulary, facts, conventions, styles, reasoning patterns, all of it, to predict the right next word across every domain it's ever seen.

Tokens: The Actual Units

One piece of vocabulary worth learning: tokens.

LLMs don't think in words exactly. They think in tokens, chunks of text that are usually about 3/4 of a word on average. Some words are one token, some long words are two or three.

Why this matters for you:

Pricing is per token, so longer prompts and longer responses cost more
Context windows are measured in tokens, a 200K context window holds roughly 150,000 words of input + output combined
Rate limits are usually expressed as tokens per minute

Practically: 1,000 tokens is roughly 750 words. A single-page document is ~500 tokens. A 50-page PDF is ~25,000 tokens.

Why LLMs Sometimes Make Things Up

The technical term is hallucination. It happens when an LLM produces something that sounds right but isn't factually accurate, made-up citations, fabricated quotes, invented statistics, non-existent features.

Here's why this isn't a bug you can fully patch:

The model isn't looking things up. It's predicting what sounds plausible based on patterns in its training data. When you ask "What's the name of the CEO of [small company]?" and the answer wasn't well-represented in training, the model doesn't have a way to say "I don't know what's probable here." It produces its best guess and presents it confidently.

Common knowledge (the sky is blue)

Well-represented in training

Very high accuracy

Broadly documented facts

Lots of training data, consistent sources

High accuracy, occasional errors

Recent events

May be post-training-cutoff

Unreliable, use web-connected AI (Perplexity, ChatGPT browsing)

Specific niche facts

Sparse or contradictory training data

Very likely to hallucinate

Things it was never exposed to

Your internal docs, private data

Will confidently invent plausible answers

The practical rule: never trust an LLM's factual output about something specific without verification. The model is wildly useful for reasoning, drafting, and pattern-matching. It is an unreliable source of facts unless you've given it the source material directly (see what is RAG).

Context Windows: How Much the Model Can "See"

The context window is the maximum amount of text the model can process in a single request. This includes your prompt, any documents you've attached, the conversation history, and the generated response, all of it combined.

Current state of the art:

Claude Sonnet 4.5

200K tokens standard, 1M tokens in the experimental tier. Roughly 150,000 words, comfortably fits a long book or an entire codebase.

GPT-4 / GPT-5

128K tokens. Roughly 96,000 words. Enough for most business documents but not entire books.

Gemini 2.5 Pro

2M tokens. The largest in production. Can handle entire document libraries, full codebases, or video transcripts at scale.

Larger context windows enable capabilities that weren't practical a year ago, you can paste an entire repository and ask questions, or dump a 40-hour transcript archive and ask for patterns. But larger isn't automatically better. Bigger context means slower inference and higher cost. Use the window you actually need.

What's Actually Happening When You Chat With One

A useful mental model:

You type a prompt.
The model reads your prompt + any system instructions + any attached knowledge, all together.
The model generates a response token-by-token, predicting each one based on everything it's seen so far.
When you reply, the model reads the entire conversation again from the top, it has no memory between messages besides what's visible in the context window.
Once the conversation exceeds the context window, the earliest messages get dropped. The model forgets them.

This is why a good system prompt (see how to write a prompt) matters so much: it's always in the model's view. A throwaway detail you mentioned 20 messages ago is probably gone.