Visualizing an LLM: How Language Models Think

Chapter 1

A Pile of Random Numbers

Before training, a language model is nothing special. It's a massive collection of numbers, initialized randomly. Billions of them, organized into matrices, waiting to be shaped.

Think of it as clay before the sculptor touches it. The structure is there (the architecture), but the values are meaningless noise.

Interactive: Watch Training Happen

Words start randomly scattered. Click "Train" to watch them organize into meaningful relationships.

dimension 1

dimension 2

king

queen

man

woman

Paris

France

Berlin

Germany

Epoch: 0 (untrained)

Each word becomes a vector, a list of numbers representing its position in a high-dimensional space. Training adjusts these positions so that related words cluster together.

The magic of training (called "backpropagation" or "autograd") is that the model learns to position words so that their relationships become geometric. King minus man plus woman equals queen. Not because anyone programmed that relationship, but because it emerged from reading billions of sentences.

Interactive: Vector Arithmetic

Relationships between words become mathematical operations. Try different analogies.

king

−

man

+

woman

=

≈ queen

The "royalty" component stays, the "male" component is removed, and the "female" component is added.

Chapter 2

Inside the MLP: Weights and Activations

Each layer contains millions of weights - fixed numbers that define the strength of connections between neurons. Think of weights as the permanent wiring of the network. They don't change during inference; they're set during training and stay constant.

What does change with each input are the activations - the signals that flow through the network. When you feed in "cat," certain neurons light up strongly (high activation), while others stay quiet (near-zero activation). The same weights produce completely different activation patterns for "democracy" versus "quantum."

Interactive: Neuron Activations

Click different words to see which neurons activate. The weights stay the same - the activations change.

Each cell represents a neuron. Orange = strongly activated. The underlying weights are always present; only the activations are sparse.

This sparsity of activations is important. For any given input, most neurons produce near-zero output. The model develops specialized "detectors" - some neurons respond to concrete nouns, others to abstract ideas, others to grammatical structures. But each neuron might respond to multiple unrelated things (this is called "polysemanticity"), which makes interpretation tricky.

Key distinction: Weights are the fixed parameters learned during training. Activations are the dynamic signals that depend on the input. Weights define what the network can compute; activations are what it does compute for a specific input.

Chapter 3

Attention: How Words Talk to Each Other

So far we've seen how individual words become vectors and how layers transform them. But there's a crucial missing piece: how does the model handle context? How does it know that "bank" means something different in "river bank" versus "bank account"?

The answer is self-attention - a mechanism that lets each word look at every other word in the sentence and decide which ones are relevant.

Interactive: Attention in Action

Click on a word to see which other words it attends to. Stronger lines = more attention.

The cat sat by the river bank

Click a word to see its attention pattern

Think of it like a group chat where everyone can message everyone else. Before processing "bank," the model sends out queries: "Who has relevant context for me?" The words "river" and maybe "sat" respond strongly. This contextual mixing happens before the MLP layer does its transformation.

Attention is what makes transformers so powerful at understanding language. Without it, each word would be processed in isolation, blind to its neighbors. With it, the model can resolve ambiguity, track references ("it" refers to what?), and build rich contextual representations.

Each transformer layer actually has two parts: first an attention mechanism (words mixing information), then an MLP (transforming each word's now-enriched vector). They alternate throughout the network.

Chapter 4

Stacking Layers

A single attention+MLP block can only do limited transformations. The power comes from stacking dozens of them.

Each "layer" in a transformer is really two operations: first attention (words sharing context), then an MLP (transforming the enriched vectors). This pattern repeats 30, 50, even 100+ times in modern models.

Interactive: Layer Depth

Adjust the slider to see how layers stack. Modern LLMs have 80+ layers.

Layers: 1

Information flows from top (input) to bottom (output). Each layer = attention + MLP.

Early layers tend to learn basic features: syntax, simple word associations, local patterns. Middle layers build more abstract representations - combining concepts, resolving ambiguities. Later layers handle high-level reasoning and prepare the final output.

Each layer builds on everything before it. By the time a token reaches the final layer, it has been transformed dozens of times, with each word having "talked to" every other word at multiple levels of abstraction.

Chapter 5

Inference: The Forward Pass

When you send a prompt, it flows through every layer in sequence. This is called the "forward pass." Each layer transforms the representation, and the final layer produces a probability distribution over all possible next tokens.

Interactive: Watch Inference

See a token flow through layers to produce a prediction.

Hello

Layer 1 (embed)

Layer 2

Layer 3

Layer 4

Layer 5 (output)

world

The entire process takes milliseconds. Billions of multiplications, happening in parallel on GPUs, transforming your input into a response one token at a time.

Chapter 6

Peeking Inside: Interpretability

We've seen that neuron activations are sparse - most are near-zero for any given input. But what do those activations mean? This is where mechanistic interpretability comes in.

There's a catch: individual neurons are often "polysemantic" - they activate for multiple unrelated concepts. One neuron might fire for both "legal documents" and "breakfast cereals." This makes direct interpretation difficult.

One solution: train a second, simpler model (a Sparse Autoencoder) to "interpret" the first model's activations. The SAE learns to reconstruct the original activations using a much larger set of features, but with a sparsity constraint - only a few features can be active at once. This forces it to untangle the polysemantic mess into cleaner, more interpretable features.

Interactive: Learning to Interpret

A smaller model trains on the larger model's activations, learning sparse feature dictionaries.

Main Model

◀ target layer

→

activations

→

Sparse Autoencoder

encode

→

decode

???

Click "Train" to watch the interpreter learn features

After training, the sparse autoencoder's features often correspond to interpretable concepts: "is this about science?", "is the sentiment negative?", "is this code?". These aren't programmed - they emerge from learning to compress and reconstruct the main model's activations.

This is an active research area. The hope is that by understanding what features a model uses, we can better predict its behavior, catch problems, and build more trustworthy systems.

Chapter 7

Why This Matters

Understanding the architecture helps you understand both the capabilities and limitations of these models.

The vectors explain analogical reasoning. If king-man+woman=queen works, it's because the training process discovered that relationship geometrically. Whether this is fundamentally different from how humans represent concepts is an open question - we might work more similarly than we assume.

Attention explains context-sensitivity. The model isn't just pattern-matching on individual words - it's dynamically routing information between words based on relevance. This is why "bank" means different things in different sentences.

The layers explain capability scaling. More layers mean more transformation steps, more opportunity to build abstract representations, more capacity for complex reasoning. This is why bigger models can do things smaller ones can't - not just "more of the same" but qualitatively new capabilities.

Sparse activations enable interpretability. Because only a fraction of neurons activate for any given input, we can study those activation patterns. And with tools like Sparse Autoencoders, we can start to untangle what the model has learned - even when individual neurons are polysemantic.

This is an active research area. The hope is that by understanding what features a model uses, we can better predict its behavior, catch problems, and build more trustworthy systems.