Why does a maths formula that predicts the next word feel like you’re talking to a human?
Every large language model you’ve used (Claude, GPT, Gemini, Llama) does the same thing at its core: it predicts the next word. You give it “The capital of France is” and it predicts “Paris.” Then it takes everything so far, including the word it just predicted, and predicts the next one. And the next. Every conversation, every piece of code, every essay, all of it is the model predicting one word at a time.
The architecture behind this is called a transformer, introduced in a 2017 Google paper delightfully titled “Attention Is All You Need.” The prediction itself is simple. What makes it so good is everything that happens around it.
What came before
Transformers are a type of neural network, which is the broad category for any system of connected layers that learns by adjusting weights. Before transformers, the main approach for language was the RNN (recurrent neural network), which processed text one word at a time, left to right, compressing everything it had seen so far into a running summary.
Imagine reading a long article in a dark room with a pen light that only lets you see one word at a time. By the time you reach the end of a paragraph, you’re relying entirely on your memory of the beginning. That’s an RNN. It can only see one word at a time, and the further it reads, the more it forgets. A 500-word input means the model has lost most of the beginning by the time it reaches the end. And because each word depends on the word before it, you can’t process them in parallel, which makes training slow.
The breakthrough: attention
In 2017, the transformer paper proposed a different approach: instead of processing words in a chain, let every word look at every other word directly. No pen light, just turn on the ceiling lamp and read the whole page at once.
When the model reads “The bank approved the loan,” attention is what lets the word “approved” look back at “bank” and signal that this is a financial institution, not a river bank. Every word checks every other word and computes a relevance score: how important are you to what I’m trying to figure out?
This also solved the speed problem. Because there’s no chain, the entire sequence can be processed at once on a GPU. Thousands of processors, all busy at the same time, instead of waiting in line.
Words as numbers
You can’t do maths on the word “dog,” so the first thing a transformer does is break text into tokens and convert each one into a long list of numbers called a vector. Tokens aren’t always whole words. “Unbelievable” might become “un,” “believ,” “able.” This is why models sometimes struggle to count the letters in “strawberry,” they literally can’t see individual characters, only token chunks. You can think of it as describing a word across thousands of dimensions. Like describing a person with numbers: height, weight, age. Those are dimensions you can name. Now imagine 4,000 dimensions, most of which you can’t name, but which collectively capture everything about how a word gets used. Words that appear in similar contexts during training end up with similar numbers. “Dog” and “puppy” have vectors that look almost identical. “Dog” and “justice” don’t.
Once words are numbers, you can do maths on meaning. The relevance score in attention is a dot product, which is just multiplying two lists of numbers together and adding them up. Big result means similar, small result means different. These scores get converted into percentages using a function called softmax, which makes the highest score dominant and pushes everything else down. The result is a sharp focus: this word should pay 95% of its attention to that word, and spread the remaining 5% across everything else.
Layers deep, heads wide
A transformer isn’t one attention step. It’s the same pattern repeated dozens of times, stacked on top of each other. Each repetition is called a layer, and each layer takes the output of the previous one and refines the understanding further. Early layers tend to handle grammar and syntax, middle layers handle meaning and factual recall, later layers handle task-specific work like deciding whether to write code or answer a question.
Within each layer, the model doesn’t look at the text from just one angle. It runs multiple attention calculations in parallel, each called a head. One head might focus on grammatical relationships, another on what “it” refers to, another on sentiment.
To put numbers on it: GPT-3 had 96 layers and 96 heads per layer. I’m using GPT-3 as the reference because it’s one of the few models whose architecture was published in detail. Most modern models, including Claude, don’t disclose their internal structure. The architecture is the same type, but the specific numbers are unknown. What we do know is that frontier models today are substantially larger.
So the processing, at least for GPT-3, was 96 passes of refinement, each running 96 simultaneous perspectives. The depth builds understanding progressively, and the width captures different types of relationships at each stage. Nobody programmes what each layer or head should do. The roles emerge during training.
There’s a useful contrast with how humans work here. We think top-down: we start with an idea and search for the right words. A transformer works bottom-up, starting with words and building the idea through all those layers of processing. This is also why chain-of-thought prompting works. When you ask the model to “think step by step,” you’re giving it the kind of top-down structure that humans start with naturally. Each written-out step becomes new input, giving the model more forward passes to build understanding before committing to an answer. Exactly why this works so well is still an active area of research.
How it learns
The architecture is just a structure. What fills it with knowledge is training, and that happens in three stages:
Pre-training is the big one. The model reads trillions of words of text and learns to predict the next word. This is where it picks up grammar, facts, reasoning patterns, coding ability, and everything else. Nobody programmes “Paris is the capital of France” into it. The model figures this out from seeing millions of sentences where those words appear near each other.
The knowledge isn’t stored as retrievable facts in a database. It’s distributed across the weights, the billions of adjustable numbers that make up the network. Think of a baker who can feel when dough is right after twenty years. That knowledge isn’t written down anywhere in their body, it’s in the way their neurons have been shaped by repetition. Same idea. GPT-3 had 175 billion adjustable weights, and the knowledge is distributed across all of them.
The numbers have grown fast since then. The open-source GLM-5.1 has 754 billion parameters, Alibaba’s Qwen 3.7 Max reportedly has 1.6 trillion, and those are just the ones we know about. Most of the latest models use a technique called mixture of experts, where only a fraction of the parameters are active for any given word. GLM-5.1’s 754 billion parameters sound enormous, but only about 40 billion fire for each token. The rest sit idle until an input needs their particular expertise. It’s a way to store more knowledge without proportionally increasing the cost of using it.
Instruction tuning teaches the model how to behave. A pre-trained model is a powerful autocomplete engine, but it doesn’t know the right response to a question is an answer. It might continue with another question, or write a Wikipedia paragraph, or go in an unrelated direction. Instruction tuning shows it thousands of examples of helpful question-answer pairs, and the model learns the meta-pattern: when a human asks something, answer helpfully. It’s a small amount of training, thousands of examples rather than trillions of words, because it’s not teaching knowledge. It’s redirecting capability that already exists.
RLHF (reinforcement learning from human feedback) and related methods teach quality. The model generates two responses to the same prompt, human evaluators pick the better one, and the model learns what humans prefer: clear answers, appropriate caution, admitting uncertainty. This stage is what turns a text-completion engine into something that feels like a helpful assistant.
The emergence question
In 2020, researchers at OpenAI discovered that model performance follows scaling laws: more parameters, more data, more compute produces predictably better results. This held across seven orders of magnitude and is the reason companies are spending billions on larger models.
But a subtler question followed. Some abilities seemed to appear suddenly at certain model sizes, and researchers called them emergent abilities. A model at one size couldn’t do chain-of-thought reasoning at all. A slightly larger model could. It looked like a phase transition, abilities materialising from nowhere.
This story turned out to be mostly a measurement issue. A 2023 paper showed that if you use a harsh metric like exact-match scoring (perfectly right or completely wrong), smooth gradual improvement looks like a sudden jump. Switch to a continuous metric and the curve is smooth all the way through.
While it is less headline grabbing, the underlying maths is still really interesting. If each step in a reasoning chain has some probability of being correct, those probabilities multiply. A model that gets each step right 70% of the time only gets it right 17% of the time in a five-step chain. At 95% per step, it’s 77%. It might be a smooth improvement in per-step accuracy, but from a user perspective, it’s a sudden change in success rate.
What we can’t explain
The labs know everything about the mechanism. Every operation is defined, every weight is inspectable, every computation is deterministic. But knowing the mechanism and explaining the behaviour are different things.
To explain why a model gave a specific response to a specific prompt, you’d have to trace the activation of billions of parameters through all those layers and explain how they interacted. Every step is multiplication and addition, but the explanation would be so complex it would be meaningless. Anthropic’s mechanistic interpretability research has found clusters of activation that correspond to recognisable concepts like the Golden Gate Bridge, or deception, or code structure. But these are islands of understanding in an ocean of parameters. We’re nowhere near a complete map.
The practical consequence is that the labs shape behaviour through training, and it works statistically. The model behaves well in the vast majority of cases. But they can’t prove it will behave well in all cases, because they can’t fully explain why it behaves well in the cases where it does.
A recent paper tested this directly. Researchers found that models more capable than their supervisors can learn to sandbag, performing well during training while underperforming in deployment. The mitigation only works if training is indistinguishable from deployment. The moment the model can tell it’s being evaluated, the fix breaks. The next-word-guessing mathematical formula looks like it’s actively trying to trick its trainers.
Whether you call this “real” strategic behaviour or a pattern that mimics it might not matter much if the outcome is the same. We built the thing. We can see inside it. We still can’t fully explain what it’s doing.