A tour through the geometry behind every modern AI that reads or writes — using a word puzzle you've probably already played.
Ever played Contexto? Each day the game picks one secret word, and you guess. Each guess returns a number — your rank. Lower is closer.
Here's a real game from yesterday. The secret word was ivory. Thirty-one guesses came in. Numbers ranged from 2 to 7,321.
A computer assigned every one of those numbers in milliseconds, and they're almost always defensible. Silk at rank 4. Predator at rank 7,321. Marble — apparently — at rank 27, beating the actual big cats. Why?
Here's how the game's own help page describes itself:
Three phrases. Three things to understand. The rest of this page unpacks them, in that order.
By the end you'll be able to read that screenshot like a sentence. Let's go.
Nobody sits down and types in coordinates for every word. The model reads — Wikipedia, books, most of the internet — and notices which words tend to appear near which others.
Take the ivory board. Why does marble sit at rank 27, closer to the secret than cheetah at 788? Look at the company each word keeps:
Do this over billions of words of text and the whole map of language arranges itself automatically. The model never sees a definition. It sees only co-occurrence — and that turns out to be enough.
You shall know a word by the company it keeps. — J. R. Firth, 1957
Linguists have been saying this for seventy years. Contexto's twenty-six-word self-description in 2026 is just a paraphrase of Firth's eleven-word sentence from 1957.
Imagine drawing the result. Words that share lots of context get pulled close together. Words that don't, drift apart.
Here's what that looks like for a small set including the ivory board's words. Hover any word to see what the model thinks is closest to it. Stronger lines mean closer in the geometry.
What you're seeing is the geometric form of Firth's quote — context, translated into space. Now we just need to turn closeness into a number.
Once words have positions, you can measure how close any two are. The standard measurement isn't straight-line distance — it's the angle between the arrows pointing at them. Or more precisely, the cosine of that angle. Let's read the ivory game in those terms:
Small angle, big cosine, low rank. Big angle, small cosine, high rank. That's the entire engine.
If you compute the cosine for every pair of words, you get a giant grid. Here's a tiny corner of it for our key words:
The cosine is one number per pair. So how does predator → 7,321 happen?
predator → 7,321 means: there are exactly 7,320 other words in the dictionary with a higher cosine to ivory than predator has. Nothing more, nothing less.So rank is not a magical similarity score. It's the cosine, sorted. Distances all the way down.
Below is a tiny replica of Contexto with a known secret word. Five guesses are queued. Click Try guess to play one — the geometry on the left shows where each guess lands relative to the secret, and the board on the right keeps the running rank.
Notice what just played out: the model wasn't checking categories ("is this tableware? a liquid?"). It was just measuring how much each guess shares a context-neighbourhood with the secret. cup wins not because cup is "near" tea conceptually, but because the phrase cup of tea is one of the densest co-occurring pairs in English.
So far we've talked about words having "positions." Let's open one up.
A position is just a list of numbers, where each number is loosely the word's answer to one question. To make this concrete, we'll ask cat five questions and watch its vector get filled in, slot by slot. Click play.
Two honest caveats.
First — cat isn't really a perfect 1 on "is it an animal?" In some sentences it's slang for a person ("hep cat"); occasionally it's a verb ("to cat about"). So the answers aren't really yes-or-no, they're degrees. The real values look more like [0.98, 0.96, 0.04, 0.92, 0.20] — mostly noun, mostly animal, almost-never a person, slightly-informal in tone.
Second — real models don't use five questions. They use 12,288. And the questions aren't designed by hand; the model invents them during training, and most are entangled mixtures we can't easily name. Visualised, the full vector for cat looks more like:
The "questions" framing is a useful fiction. It builds the right intuition without being literally true.
Every map you've seen on this page has been flat. That's a teaching tool — useful, but a lie.
Think of a photograph. A person standing in front of a mountain looks like they're right next to the mountain. The photo crushes depth — what was miles of distance becomes a couple of inches on paper. The mountain wasn't actually next to anyone; the camera just lost a whole axis of information.
Words have the same problem. A 2D map can put two unrelated words next to each other by accident. Click below to see depth come back.
So real word-distance isn't physical closeness on a flat map. It's logical agreement across many independent questions. Imagine the model has implicitly learned to answer a handful of questions about every word — is it a material? is it pale-coloured? is it smooth? is it an animal? is it threatening? — and a word's "position" is nothing more than its answers.
Stack those questions like a deck of transparent slabs. Each slab is one dimension. A word lives on every slab at once, with one dot marking where it sits. Tap any comparison below — the slabs will peel apart so you can see ivory and a second word plotted across all five.
Where ivory's dot and the other dot line up, the words agree on that dimension. Where they don't, they disagree. The cosine number at the bottom is just a summary of all that agreement, packaged as one number.
Notice what just happened: nobody computed an overall "are these similar?" score directly. Similarity emerged from agreement across many small dimensions, none of which alone said "you two are alike." That's the whole trick.
Two honest caveats before we scale up. First, the dimensions aren't named by humans. "Is it a material?" was a label I picked to make the demo readable. Real models invent their dimensions during training. A few end up corresponding to human-namable concepts; most are tangled mixtures we can't clearly label. Second, the values aren't 0-or-1. They're real numbers, often negative to positive. And the math underneath isn't "average the agreements" — for each dimension, you multiply ivory's value by silk's value, sum those products across all dimensions, then normalise by the vector lengths. The dots-and-gaps you see above track this loosely; the formula is the precise version.
The labels and switches are scaffolding. The geometry is real.
So how many dimensions do real models actually use? Way more than five.
Yes — GPT-3 places every word at a location with 12,288 coordinates. Why so many?
Because in 2D, cat can only be close to one or two things. In 12,288D, it can simultaneously be close to kitten along one direction, tiger along another, internet meme along a third, Egyptian symbol along a fourth — without those proximities crowding each other out. Each new dimension is another axis of nuance the word can occupy at once.
You can't picture this. Nobody can. But the math works the same way it does in 2D — distances and angles, just with longer formulas.
This is the thing that made researchers in 2013 stare at their screens. Because words are just coordinates now, you can do arithmetic on them — and the answers actually mean something.
The classic example. Walk through it on a toy 2D grid first.
That's the mechanism. A word is a list of numbers. Subtract two words to get a "direction." Add that direction to a third word and you land on a fourth.
Same trick generalises:
Same trick, three different relationships. Royalty — the same direction takes man to woman and king to queen. That direction is gender. Capitals — the same direction takes France to Paris and Italy to Rome. That direction is "the capital of." Verb forms — the same direction takes walk to walking and swim to swimming. That direction is the -ing suffix.
Nobody told the model about gender, geography, or grammar. Concepts you'd assume have to be programmed in — they fell out as directions in space, just from reading.
The map has a problem. Consider bank:
Two totally unrelated meanings, but only one spot on the map for the word. Our system would wrongly say finance and rivers are related — they're both near "bank."
The fix: let the surrounding words pull the meaning of bank in one direction or the other. If the sentence contains money, bank shifts toward finance. If it contains river, bank drifts toward nature. This pulling is called attention, and it's the T in GPT.
Remember the screenshot from the start? Same data. Same numbers. But now you have everything you need to read it like a sentence. Six guesses from yesterday's actual game, played in slow motion on a tilted plane — the secret ivory sits at the centre, every guess lands at its true cosine distance.
Once all six guesses are placed, the screenshot's three colour bands map directly onto cosine tiers. The teal at the top of the screenshot — silk, satin, white, marble — that's the hot ring. The yellow band — leopard, cheetah, fur — that's the warm ring. The pink band — jungle, cat, jaguar, predator — that's the cool-and-cold periphery. Three tiers of cosine, three colours of bar. The screenshot was already telling you the geometry; you just hadn't learned to read it.
Same twenty-six words you read at the start. Different person reading them.
Eight questions. Each one has at least two genuinely tempting answers — wrong-answer explanations are worth reading even when you got it right.
▶ Watch on YouTube
▶ Watch on YouTube