‘How I used an LLM to learn about an LLM’
First things first: I’ve been using LLMs for about a year now — and I absolutely love them. I use ChatGPT every single day, and in that time I’ve applied it to almost everything: from learning and configuring Proxmox, to making the final switch from Windows to Linux, to picking up the basics of Ansible, solving a persistent JWT token issue, working through my weight-loss journey — and so much more.
And truth be told: I had some idea of how an LLM worked, but I didn’t really understand what “training” meant — or how an LLM actually arrives at an answer. Last night, I asked (in Dutch): “Vraag; Zou je meteen relationele database, zoals Oracle DB, een LLM op kunnen zetten?” Roughly translated: “Could you use a relational database, like Oracle DB, to build a model that represents how an LLM works?”
And from that one question, the real journey began.
Relational database?
Why this specific question, and not a more generic “how does an LLM work?” The answer is simple: I know relational databases really, really well. I’ve been working with Oracle for nearly three decades. I’ve written countless SQL queries, designed dozens of database models from scratch, and authored more PL/SQL code than any other language. I can visualize all of it clearly in my mind — which makes it the perfect starting point to attach new knowledge to something familiar.
ChatGPT had the perfect answer. It created a few example tables to show how you could model the basics of an LLM in a relational way — and then immediately pointed out that this would be a terrible way to actually implement one. It also stated this:
“You’re building a bridge between your familiar world (relational logic, Oracle, PL/SQL) and the abstract world of LLMs (vectors, layers, attention). It’s the perfect starting point if you really want to understand how it all works.”
It created four simple tables:
-
Tokens:
ID
,TOKEN_TEXT
-
Token_embeddings:
TOKEN_ID
,[dim1, dim2, dim3, dim4... dimN]
(a vector) -
Dense_weights:
FROM_DIM
,TO_DIM
,WEIGHT
-
Dense_bias:
TO_DIM
,BIAS
And that was it.
And suddenly — everything started to click into place. I actually got goosebumps
Every single token (word) is a vector in a multidimensional space. And the vector itself? Completely meaningless.
And somehow, it also clicked into place that attaching a specific vector to a token — permanently — is actually wrong, at least to a certain extent. Only in the very last step is a vector converted back into a token. Before that, it’s all just… directions in space.
And the key to it all? The DENSE_WEIGHTS
“table”. This defines the relationship between two vectors — not in terms of mass or strength like you might expect, but as a value used in a vector transformation.
The weight itself is meaningless in isolation. But when applied through vector multiplication between a FROM and TO vector (remember: two words), it gives you a new vector — one that represents the most likely next step in meaning.
Or put more simply:
The power of an LLM isn’t in the words themselves, but in the relationships between them.
And with the DENSE_BIAS
“table”, you can nudge the output vector toward a more specific point in space — and remember, every single word lives in that space as a vector.
Note: No, there’s no actual SQL table with weights in a real LLM. (Sorry, fellow PL/SQL fans.) These “dense_weights” are just a simplified analogy — in practice, LLMs use massive matrices and pass each token through dozens of dense layers, each with its own set of learned weights and biases. But hey, don’t let that ruin the fun of thinking about neural networks in SELECT * FROM style. 😛
Vectors
I’ve used the word vector a few times now, so let’s break that down. The simplest way to describe a vector is to imagine it as an arrow — pointing from a starting point to an endpoint in space. Picture a 2D chart with an X and Y axis: a vector [2, 2]
simply means “go two units right and two units up” from the origin [0, 0]
.
Mindfuck, thinking in X dimensions
Most of us can still picture a 2D chart, like the one above. Imagining a 3D world with X, Y, and Z axes? Still doable. Thinking in 4D? Okay, that’s where it starts to get weird — do I need to mention a Tesseract here? But what happens when you’re dealing with 768 or even 12,288 dimensions?
The vectors I mentioned earlier? They use way more than just 2 or 3 dimensions. Some models operate in 768-dimensional space, while modern LLMs go all the way up to 12,288 dimensions — that’s 16 times as much.
There’s absolutely no way to visualize that — at least, I sure can’t. 😛
But here’s the thing — the math doesn’t really change.
In 2D, it’s [2, 2]
.
In 3D, it’s [2, 2, 2]
.
In 768D? It’s just [2, 2, 2, 2, 2... 2]
.
Still just an array of numbers.
And here’s the best part: You don’t need to know the math to understand how an LLM works. That’s the real secret.
Training
Now imagine every single word you know — including all the typos you’ve ever made. Now picture that each of those words has its own vector in that insane multidimensional space. Words are everywhere in that space. Or to put it differently: imagine our entire universe. Galaxies. Stars. Planets. And every single one of them? A word. And every typo ever made? Just another moon orbiting one of those planets.
But how does an LLM get to this structure — this entire cosmos of galaxies, stars, planets, moons… and Pluto, of course? That’s where the training of an LLM comes into play.
To stick with the universe analogy: Let’s imagine a completely empty universe. There are no stars, no planets — no words. Just void.
Then the vocabulary is loaded. All the words are pulled into this universe, and every single one of them gets assigned a completely random vector.
And yes — you read that correctly: random. Not only does every token get a random vector, but random weights are also assigned between those vectors. So after loading the vocabulary, both the positions of the words and the relationships between them are completely random.
Then the training starts. All the text is fed back into the system — but now, the algorithm tries to predict the next word in a sentence. Take this example: “The cat always goes to the litterbox.” The algorithm starts by trying to predict the next word after “THE”.
And I’ll keep it simple from here on. The algorithm remembers the previous context from the sentence. We, as humans, know that “litterbox” is the most predictable outcome. But here’s the catch: the links between "THE"
, "CAT"
, "ALWAYS"
, "GOES"
and "TO"
are still completely random at this point.
Before we dive deeper into the training phase, let me first show you what a trained LLM actually looks like — just to give you a feel for the destination.
Cat & Litterbox
Imagine the sentence again: “The cat always goes to the litterbox.” This is where tokenization comes into play — it’s the process of breaking a sentence into its most basic components. In this case, it becomes:[THE, CAT, ALWAYS, GOES, TO, THE, LITTER, BOX]
Notice how even a single word like “litterbox” can be broken down into multiple tokens: “litter” and “box”.
Every single token has its own vector in that insane, multidimensional, and trained, space. Now imagine the link between "CAT"
and "THE"
— remember: WEIGHT.
In table form, it might look like this:
FROM_DIM | TO_DIM | WEIGHT |
“CAT” | “THE” | 80 |
I’ve used tokens here for readability, but don’t be fooled: these are still vectors. There’s a relationship between the two vector representations, and when you apply the weight (in this case, 80
) to the FROM_DIM
vector, you get a new vector — through a vector multiplication.
That new vector now points somewhere in the vector space. And odds are, it’s very close to the vector associated with… “LITTERBOX”.
So when an LLM is fully trained, every word has its own specific vector — and similar words end up in the same general area of vector space. Even typos!
After training, you get galaxies of related meanings, stars representing more specific clusters of words, and typos? You can literally think of them as moons orbiting their correct word — that’s how closely related they are.
Training part 2
Now, back to the training. How do we get from that chaotic, random mess… to a structured universe of meaning?
The weight table is still random. So when new vectors are calculated for CAT
, ALWAYS
, TO
, and THE
, we’d expect to land somewhere near the vector for LITTERBOX
. But after the vector multiplication? We end up with: PARTYPOOPER
.
The training algorithm sees that the next word in the sentence should have been LITTERBOX
— but the model predicted PARTYPOOPER
. So the prediction is wrong.
And now… the magic happens.
The vectors for CAT
, ALWAYS
, TO
, and THE
are each nudged — just slightly — within the universe. The weights between those vectors are also adjusted by tiny fractions. (Remember: we’re still talking about tokens — about words!)
There’s quite a bit of intricate math involved — you can absolutely look it up if you’re into that sort of thing. But it’s outside the scope of this blog post :P. Just imagine: Q, K, and V matrix multiplications quietly doing their thing in the background…
But here’s the beautiful part:
Every single relationship between words causes stars, planets, and moons to shift — just a little — across the entire universe. And if you apply enough of those tiny shifts…
you end up with a universe.
And that’s exactly why an LLM needs to be trained on an insane amount of data.
Simplified
This is, of course, a very simplified explanation of how an LLM works. I’ve deliberately left out a lot of complexity.
But just try to imagine our universe again. Every single dot is a word. Every word is connected to every other word. And between any two words, there’s a weight — a value we can use to do a calculation.
That calculation gives us a vector, which points us to another location in the universe — where the most probable next word will be.
I haven’t even mentioned backpropagation, cosine similarity, or the whole Query–Key–Value (Q, K, V) mechanism and how it’s used. I haven’t touched on how layers work, or what it means to operate in a space with 768 or even 12,288 dimensions.
And still — even just dropping those phrases feels like I’m keeping things simple.
Trust me, I’m still nowhere near being an LLM expert. But the journey ChatGPT took me on — last night and again this morning — gave me just enough clarity to actually see how an LLM works. And honestly? That journey alone was more than interesting enough to write an article about.
When translated into a 3D world, I could see the words floating in that space. I could see how the relationship between two words forms a new vector — one that points toward the most probable set of words that might follow.
That realization alone was a complete eye-opener.
I hope this article gives you a bit of insight into how an LLM works — and that next time you use ChatGPT or another model, you’ll know just a little more about what’s going on under the hood.
Disclaimer: This blog post was proudly co-written, edited, sanity-checked and vector-approved by ChatGPT. 😛
Got thoughts, questions, or an inspired vector of your own? Drop a comment below — I’d love to hear it.
Brain out!