A field guide to embeddings

How language models
understand words

A tour through the geometry behind every modern AI that reads or writes — using a word puzzle you've probably already played.

01 The hook

You've played Contexto. You already get half of it.

Ever played Contexto? Each day the game picks one secret word, and you guess. Each guess returns a number — your rank. Lower is closer.

Here's a real game from yesterday. The secret word was ivory. Thirty-one guesses came in. Numbers ranged from 2 to 7,321.

↗ enlarge

tap to enlarge · the secret word was ivory · numbers are real

A computer assigned every one of those numbers in milliseconds, and they're almost always defensible. Silk at rank 4. Predator at rank 7,321. Marble — apparently — at rank 27, beating the actual big cats. Why?

Here's how the game's own help page describes itself:

The algorithm analyzed thousands of texts. It uses the context in which words are used to calculate the similarity between them.

— contexto.me, /faq

Three phrases. Three things to understand. The rest of this page unpacks them, in that order.

Act I thousands of texts → where the numbers come from

Act II the context in which words are used → what "similar" looks like

Act III calculate the similarity → how it becomes one number

By the end you'll be able to read that screenshot like a sentence. Let's go.

02 Act I — "thousands of texts"

Where do the numbers come from?

Nobody sits down and types in coordinates for every word. The model reads — Wikipedia, books, most of the internet — and notices which words tend to appear near which others.

Take the ivory board. Why does marble sit at rank 27, closer to the secret than cheetah at 788? Look at the company each word keeps:

ivory appears near: tusks cream pale smooth marble satin

marble appears near: stone veined pale smooth polished

cheetah appears near: leopard savanna fast predator fur

Highlighted = words shared with ivory. Marble and ivory share pale, smooth, fashion-and-material context. Cheetah shares essentially nothing — it lives with savannas, fur and predators. So marble lands closer than cheetah, even though only one of them is a material.

Do this over billions of words of text and the whole map of language arranges itself automatically. The model never sees a definition. It sees only co-occurrence — and that turns out to be enough.

You shall know a word by the company it keeps. — J. R. Firth, 1957

Linguists have been saying this for seventy years. Contexto's twenty-six-word self-description in 2026 is just a paraphrase of Firth's eleven-word sentence from 1957.

Under the hood — how the model actually trains

Modern implementations come in two flavours. word2vec (Mikolov et al., 2013) trains a small neural network to predict context words from a target (skip-gram) or vice versa (CBOW); the network's hidden layer becomes the embedding. Transformer-based models (BERT, GPT) learn embeddings end-to-end as part of the model itself, via masked language modelling or causal language modelling. In every case the training signal reduces to: make the embedding useful for predicting surrounding or missing words. Geometric structure falls out as a side effect.

03 Act II — "the context in which words are used"

If words had positions, they'd cluster.

Imagine drawing the result. Words that share lots of context get pulled close together. Words that don't, drift apart.

Here's what that looks like for a small set including the ivory board's words. Hover any word to see what the model thinks is closest to it. Stronger lines mean closer in the geometry.

Materials cluster top-left. Animals top-right. Drinks/dishes bottom-right. Nobody told the model what a "category" is.

About distance — what we mean by "close" here is logical, not physical. Two words being near each other on this map means they answer many of the model's implicit questions the same way: do you appear in fashion writing, do you describe textures, do you co-occur with pattern. We'll open up what those questions look like in section 7.

What you're seeing is the geometric form of Firth's quote — context, translated into space. Now we just need to turn closeness into a number.

04 Act III — "calculate the similarity"

Closeness becomes one number.

Once words have positions, you can measure how close any two are. The standard measurement isn't straight-line distance — it's the angle between the arrows pointing at them. Or more precisely, the cosine of that angle. Let's read the ivory game in those terms:

Cosine similarity, on yesterday's ivory game

ivory ↔ silk · about 10° · cosine ≈ 0.94 → rank 4

ivory ↔ cheetah · about 57° · cosine ≈ 0.55 → rank 788

ivory ↔ predator · about 80° · cosine ≈ 0.18 → rank 7,321

Small angle, big cosine, low rank. Big angle, small cosine, high rank. That's the entire engine.

If you compute the cosine for every pair of words, you get a giant grid. Here's a tiny corner of it for our key words:

The cosine is one number per pair. So how does predator → 7,321 happen?

For ivory, the model multiplies ivory's coordinates by every other word's coordinates, dimension by dimension. Sums those products. Divides by the lengths of the two vectors. That's the cosine — one number per other word in the dictionary.
It sorts that giant list of cosines, largest to smallest.
Your guess's position in the sorted list is the rank.

predator → 7,321 means: there are exactly 7,320 other words in the dictionary with a higher cosine to ivory than predator has. Nothing more, nothing less.

So rank is not a magical similarity score. It's the cosine, sorted. Distances all the way down.

Under the hood — the cosine formula, formally

Cosine similarity is (A · B) / (‖A‖ ‖B‖) — the dot product of two vectors divided by the product of their magnitudes. The result lives in [-1, 1], though for embedding vectors trained with typical objectives it usually ends up in [0, 1] in practice. Storing the full N×N similarity matrix would be wasteful — for a 50,000-word vocabulary that's 2.5 billion entries — so models store only the N×d embedding matrix and compute pairwise similarities on demand. The rank lookup is then a top-K search over the dot product against the target vector.

05 Watch it happen

A small Contexto, played in slow motion.

Below is a tiny replica of Contexto with a known secret word. Five guesses are queued. Click Try guess to play one — the geometry on the left shows where each guess lands relative to the secret, and the board on the right keeps the running rank.

CONTEXTO · demo secret: hidden until the end

Geometry — distance from secret

Game board — sorted by rank

guess 0 of 5

click "try guess" to begin

hot · cosine ≥ 0.85 warm · 0.50–0.85 cool · 0.30–0.50 cold · < 0.30

Notice what just played out: the model wasn't checking categories ("is this tableware? a liquid?"). It was just measuring how much each guess shares a context-neighbourhood with the secret. cup wins not because cup is "near" tea conceptually, but because the phrase cup of tea is one of the densest co-occurring pairs in English.

06 Coda I — what is a position made of?

Every word is a list of numbers.

So far we've talked about words having "positions." Let's open one up.

A position is just a list of numbers, where each number is loosely the word's answer to one question. To make this concrete, we'll ask cat five questions and watch its vector get filled in, slot by slot. Click play.

click play to begin

Two honest caveats.

First — cat isn't really a perfect 1 on "is it an animal?" In some sentences it's slang for a person ("hep cat"); occasionally it's a verb ("to cat about"). So the answers aren't really yes-or-no, they're degrees. The real values look more like [0.98, 0.96, 0.04, 0.92, 0.20] — mostly noun, mostly animal, almost-never a person, slightly-informal in tone.

Second — real models don't use five questions. They use 12,288. And the questions aren't designed by hand; the model invents them during training, and most are entangled mixtures we can't easily name. Visualised, the full vector for cat looks more like:

cat = [0.98, 0.96, 0.04, 0.92, 0.20, -0.13, 0.55, 0.31, -0.08, 0.42, 0.71, …, 0.06] ↑ 12,288 numbers in total

The "questions" framing is a useful fiction. It builds the right intuition without being literally true.

Under the hood — what the dimensions actually mean

The technical name for these "lists of numbers" is embeddings. They're learned end-to-end as part of training: a randomly-initialised embedding matrix gets nudged via gradient descent so that vectors of words appearing in similar contexts end up close together. Some of the resulting dimensions turn out to be human-interpretable (gender, plurality, sentiment, formality), but most don't. Mechanistic interpretability is the active research programme trying to disentangle them, often by looking for "directions" in vector space that correspond to specific concepts.

07 Coda II — how many dimensions?

Two dimensions is a lie.

Every map you've seen on this page has been flat. That's a teaching tool — useful, but a lie.

Think of a photograph. A person standing in front of a mountain looks like they're right next to the mountain. The photo crushes depth — what was miles of distance becomes a couple of inches on paper. The mountain wasn't actually next to anyone; the camera just lost a whole axis of information.

Words have the same problem. A 2D map can put two unrelated words next to each other by accident. Click below to see depth come back.

your pet · right here mountain · ~1 mile rising sun · miles away

All three layers look flat in the photo.

So real word-distance isn't physical closeness on a flat map. It's logical agreement across many independent questions. Imagine the model has implicitly learned to answer a handful of questions about every word — is it a material? is it pale-coloured? is it smooth? is it an animal? is it threatening? — and a word's "position" is nothing more than its answers.

Stack those questions like a deck of transparent slabs. Each slab is one dimension. A word lives on every slab at once, with one dot marking where it sits. Tap any comparison below — the slabs will peel apart so you can see ivory and a second word plotted across all five.

Where ivory's dot and the other dot line up, the words agree on that dimension. Where they don't, they disagree. The cosine number at the bottom is just a summary of all that agreement, packaged as one number.

silk → 0.97. Both materials, both pale-ish, both smooth, neither animal, neither threatening. Five strong agreements. This is what's behind silk landing at rank 4 in yesterday's ivory game.
cheetah → 0.65. Agrees on "pale," somewhat on "smooth" — but disagrees sharply on "is it an animal?" One big disagreement drags the overall number down. Hence rank 788.
predator → 0.28. Agrees with ivory on basically nothing. Cosine collapses, rank 7,321.

Notice what just happened: nobody computed an overall "are these similar?" score directly. Similarity emerged from agreement across many small dimensions, none of which alone said "you two are alike." That's the whole trick.

Two honest caveats before we scale up. First, the dimensions aren't named by humans. "Is it a material?" was a label I picked to make the demo readable. Real models invent their dimensions during training. A few end up corresponding to human-namable concepts; most are tangled mixtures we can't clearly label. Second, the values aren't 0-or-1. They're real numbers, often negative to positive. And the math underneath isn't "average the agreements" — for each dimension, you multiply ivory's value by silk's value, sum those products across all dimensions, then normalise by the vector lengths. The dots-and-gaps you see above track this loosely; the formula is the precise version.

The labels and switches are scaffolding. The geometry is real.

So how many dimensions do real models actually use? Way more than five.

EMBEDDING DIMENSIONS · LOG SCALE

this page

our world

early word2vec

300

classic word2vec

768

BERT

12,288

GPT-3

Yes — GPT-3 places every word at a location with 12,288 coordinates. Why so many?

Because in 2D, cat can only be close to one or two things. In 12,288D, it can simultaneously be close to kitten along one direction, tiger along another, internet meme along a third, Egyptian symbol along a fourth — without those proximities crowding each other out. Each new dimension is another axis of nuance the word can occupy at once.

You can't picture this. Nobody can. But the math works the same way it does in 2D — distances and angles, just with longer formulas.

Under the hood — why high dimensions matter

This is why high-dimensional spaces are useful: in sufficiently many dimensions, almost all pairs of randomly-drawn unit vectors are nearly orthogonal. That gives the model an enormous amount of "room" to encode independent features without them interfering. The Johnson-Lindenstrauss lemma formalises this — you can pack roughly 2^d nearly-orthogonal directions into a d-dimensional space.

08 Coda III — a party trick

And now, you can do math.

This is the thing that made researchers in 2013 stare at their screens. Because words are just coordinates now, you can do arithmetic on them — and the answers actually mean something.

The classic example. Walk through it on a toy 2D grid first.

That's the mechanism. A word is a list of numbers. Subtract two words to get a "direction." Add that direction to a third word and you land on a fourth.

Same trick generalises:

Same trick, three different relationships. Royalty — the same direction takes man to woman and king to queen. That direction is gender. Capitals — the same direction takes France to Paris and Italy to Rome. That direction is "the capital of." Verb forms — the same direction takes walk to walking and swim to swimming. That direction is the -ing suffix.

Nobody told the model about gender, geography, or grammar. Concepts you'd assume have to be programmed in — they fell out as directions in space, just from reading.

09 Coda IV — what embeddings can't do

One word, two meanings.

The map has a problem. Consider bank:

Two totally unrelated meanings, but only one spot on the map for the word. Our system would wrongly say finance and rivers are related — they're both near "bank."

The fix: let the surrounding words pull the meaning of bank in one direction or the other. If the sentence contains money, bank shifts toward finance. If it contains river, bank drifts toward nature. This pulling is called attention, and it's the T in GPT.

Under the hood — self-attention as Q/K/V

Self-attention works like this: each token emits three vectors — a query, a key, and a value. The dot product of one token's query with every other token's key (passed through softmax) gives a set of attention weights. Those weights then blend the other tokens' value vectors into a new representation for the current token. So "bank" near "river" attends strongly to "river" and absorbs some of its meaning into its updated vector. Stack this across many layers, run several attention "heads" in parallel, and you have a transformer. The canonical paper is Attention Is All You Need (Vaswani et al., 2017).

10 Full circle

Now read the ivory board.

Remember the screenshot from the start? Same data. Same numbers. But now you have everything you need to read it like a sentence. Six guesses from yesterday's actual game, played in slow motion on a tilted plane — the secret ivory sits at the centre, every guess lands at its true cosine distance.

CONTEXTO · 04/24/2026 · real game secret: ivory

Geometry — secret in the centre, distance = cosine

guess 0 of 6

click "try guess" to begin replaying yesterday's game

hot · cosine ≥ 0.85 warm · 0.50–0.85 cool · 0.30–0.50 cold · < 0.30

Once all six guesses are placed, the screenshot's three colour bands map directly onto cosine tiers. The teal at the top of the screenshot — silk, satin, white, marble — that's the hot ring. The yellow band — leopard, cheetah, fur — that's the warm ring. The pink band — jungle, cat, jaguar, predator — that's the cool-and-cold periphery. Three tiers of cosine, three colours of bar. The screenshot was already telling you the geometry; you just hadn't learned to read it.

"thousands of texts" → training: Wikipedia, books, the web

"the context in which words are used" → co-occurrence → vectors → positions

"calculate the similarity" → cosine of the angle between two vectors

Same twenty-six words you read at the start. Different person reading them.

11 Test yourself

Did any of this stick?

Eight questions. Each one has at least two genuinely tempting answers — wrong-answer explanations are worth reading even when you got it right.

12 Further reading

Where to go from here.

[★]

Let's build GPT — from scratch, in code★ pick of the bunch

Andrej Karpathy · YouTube · 2 hours

If you want to actually build one of these models. Karpathy walks through writing a small GPT from a blank file, with attention, embeddings, training, sampling — every line explained. The single best thing on the internet for going from "I read an explainer" to "I understand the machine."

▶ Watch on YouTube

[1]

But what is a GPT? — visual intro to transformers

3Blue1Brown · YouTube · 2024

The best visual explanation of how attention and embeddings work together inside a real model. Watch this next.

▶ Watch on YouTube

[2]

Contexto

contexto.me

The word puzzle this whole page is built around. Try yesterday's ivory puzzle if you're curious. Play →

[3]

A synopsis of linguistic theory

J. R. Firth · 1957

Source of "you shall know a word by the company it keeps" — the seventy-year-old paper whose central idea Contexto inherits.

[4]

Efficient estimation of word representations in vector space

Mikolov, Chen, Corrado, Dean · 2013

The word2vec paper. The first time embedding-based word arithmetic (king − man + woman = queen) was demonstrated. PDF →

[5]

Attention is all you need

Vaswani et al. · 2017

The transformer paper. Where the T in GPT comes from. PDF →

[6]

The illustrated transformer

Jay Alammar · blog post

A friendly walk-through of how attention actually works inside a single layer. Read →