Quick Recap
In our last post, "Demystifying Machine Learning," we unpacked the complexities of how computer vision systems recognize images, discern patterns, and analyze video streams. Our journey illustrated the similarities between machine learning and human perceptual learning, providing a foundation for understanding these advanced technologies.
Today, we extend our exploration into the domain of text processing. In this post, we will uncover the workings of Large Language Models (LLMs), getting into how these models learn to process and generate text. By the end of this discussion, you'll have a good understanding of how LLMs learn from vast amounts of text and the sophisticated mechanisms that enable them to make informed inferences.
The Rise of Large Language Models (LLMs) and the ChatGPT Phenomenon
Large Language Models like ChatGPT are shaking up the AI scene by showing us how machines can understand and mimic human language. These models, trained on truckloads of text, aren't just smart—they're versatile, handling everything from crafting articles to solving customer queries.
Why all the buzz about ChatGPT, though? It's simple: ChatGPT can chat like a pro. This isn't just about choosing words; it's about understanding context, humor, and even sarcasm, making interactions feel surprisingly human. ChatGPT's knack for generating coherent and contextually appropriate text on virtually any topic brought it into the spotlight.
All right, so we've known that LLMs are like super-readers that have devoured almost all the text on the internet and can chat in ways that feel surprisingly human. But how exactly do they learn from all that text? Let’s learn that.
A Thought Experiment
To better grasp how Large Language Models (LLMs) learn, let's engage in a thought experiment. Imagine your brain as a vast library with various sections where related words and concepts are grouped and stored. These sections may be explicitly labeled or not, depending on how our brain organizes information. And imagine, for example, when we encounter a new word or concept, our brain instinctively places it near similar words based on its meaning and usage.
Consider an illustrative animation: as new words like "Pineapple," "Nvidia," and "Boston" appear, they naturally drift toward their respective categories. "Pineapple" moves closer to "fruits," "Nvidia" finds its place among "tech companies," and "Boston" aligns with "cities."
This visualization could also be imagined as a three-dimensional space where each axis represents one of these categories: fruits, cities, and tech companies.
For simplicity, our example uses just three dimensions. However, an LLM operates within a space of thousands of dimensions. This vast dimensional space allows LLMs to form nuanced understandings of language far beyond our simplified model.
Now, let’s assign hypothetical coordinates to the example words within this space to see how closely they align with their categories:
Apple | [0.05, 5, 0.01] |
Banana | [0.01, 6, 0.01] |
Microsoft | [5, 0.01, 0.02] |
Seattle | [0.02, 0.01, 6] |
We could compute the similarity between two words in this space by measuring the distance between them using high school math. From similarity score we get to know if the words or concepts are semantically similar or not. Some key formula that can be used are:
1. Manhattan distance: Also known as city block distance, Manhattan distance measures the sum of the absolute differences of their Cartesian coordinates.
2. Euclidean distance: Euclidean distance measures the straight-line distance between two points (or vectors) in the vector space.
3. Cosine similarity: Cosine similarity measures the cosine of the angle between two vectors.
From 3D to High-Dimensional Space
We visualized words as points in a three-dimensional space based on categories like fruits, cities, and tech companies. Now, imagine this space expanding into hundreds or even thousands of dimensions. Each additional dimension can represent another aspect or nuance of the words—be it emotional tone, formality level, context of usage, or any subtle linguistic feature.
Illustrating N-Dimensional Space
To illustrate, think of a word like "Apple." In a three-dimensional space, it’s categorized with fruits. But expand this into more dimensions:
One dimension measures its connotation (positive, neutral, negative).
Yet another might check its association with technology due to the brand "Apple."
As we add dimensions, "Apple" finds a unique position in this high-dimensional space that reflects its multiple uses and meanings in different contexts.
Why N-Dimensions?
The reason LLMs use such a high number of dimensions is to capture the complexity and subtlety of human language. Every additional dimension allows the model to distinguish finer details and relationships between words and phrases. This capability is crucial when the model decides how closely related two pieces of text are, or how likely one word is to follow another in a sentence.
Complex Conceptual Relationships
In an n-dimensional space, complex concepts like "democracy," "justice," or "freedom" can also be represented. These aren’t just static points but dynamic ones that shift slightly depending on the discourse around them. For instance, "democracy" in a discussion about ancient Greece might align more closely with "polis" and "citizenship," whereas, in a modern political context, it may move closer to "elections" and "rights." Here is another example of this.
Consider the following sentences.
“I went to the store and bought 5 lbs. of Honeycrisp Apple.”
“Apple launches a new device called Vision Pro.”
The word ‘Apple’ appears in both sentences, however, Apple in the first sentence its fruit is being referred to, and in the second sentence, it’s the tech company being referred to as Apple. So, words shift their position based on the context of the text. In this case, Apple would get drifted toward the tech companies' axes in the example above.
Now if we imagine our brain as a vast language model and pose a question: Write 1-2 sentences about the latest iPhone. In response, our brain swiftly searches through the relevant information it has stored about iPhones and might produce a statement such as: "The iPhone is the flagship product of a major technology company named Apple."
Conceptual Framework
1. Mental Embeddings: As a thought experiment, imagine that the human brain organizes concepts and words within a multidimensional mental space. Each word or concept is not just a point but a vector, where its position and relation to other vectors represent semantic and syntactic characteristics. For instance, synonyms might be stored close together, while antonyms are farther apart.
2. Dimensionality: In human cognition, dimensions could represent various linguistic features or abstract concepts such as tone, context, emotional charge, and more. When we engage in sentiment analysis or text generation, our brains likely navigate this space to select and sequence words that align with the desired sentiment or narrative structure.
Cognitive Process
Sentiment Analysis:
Emotional Dimensions: Humans perceive words and phrases through emotional dimensions that are part of our cognitive vector space. For example, the word "love" might be positioned closer to "happiness" and farther from "sadness," aiding in sentiment understanding.
Contextual Modulation: The brain adjusts the significance of these dimensions based on context, which can change the perceived sentiment of words or phrases. This dynamic adjustment allows for nuanced understanding beyond just basic word associations.
Text Generation:
Retrieval and Sequencing: When generating text, the mind retrieves relevant words from this n-dimensional space based on the context and the intended message. This involves traversing through related concepts and structuring them in a sequence that makes logical and grammatical sense.
Predictive Planning: Much like predictive text in LLMs, the human brain anticipates which words or syntactic structures are likely to come next, planning several words as we speak or write.
How do LLMs learn from texts?
How exactly our brain works could be best explained by neuroscientists. Nevertheless, the thought experiment we discussed above introduces several concepts employed by Large Language Models (LLMs). We'll go into each of these concepts to understand the high-level transformer architecture of LLMs.
Definitions and Key terms
LLMs: Language modeling is a subfield of NLP (natural language processing) that involves the creation of a statistical model for predicting the likelihood of a sequence of tokens in a specified vocabulary. They predict the next word one at a time, keeping in memory the context of the previously predicted word. There are generally two kinds: Autoencoding tasks and Autoregressive tasks.
Tokens: A token is the smallest unit of semantic meaning, created by breaking down a sentence into smaller units. It could be words or sub words. Tokenization is the processing of breaking down a sentence into tokens.
Autoregressive: These are the language models trained to predict the next token in a sentence. Ideal for text generation. This is the decoder part of the transformer architecture.
Autoencoding: These are the language models trained to reconstruct the original sentence from a bad version of the input. Their main application is sentence classification or token classification.
An LLM may be either autoregressive, autoencoding or both.
Transformer Architecture
Here is the diagram from the original paper of the transformer architecture (Attention is all you need). It may look intimidating, but we’ll tease it apart and understand in simple terms, what each component means.
The left side of the diagram is called the encoder and the right side is known as the decoder.
1. Input Embedding
What it does: Converts each input word into vectors. These vectors are learned representations that capture some semantic meanings of the words. This is like the vector representation of the fruits that we saw above in the thought experiment. For example, for the word ‘Apple’ the value was [0.05, 5, 0.01]. This vector tells us how close the word ‘Apple’ is to being a fruit, a tech company, or a city.
How it gets generated: These embeddings are produced by a neural network, such as word2vec, or more complex models like "text-embedding-3-small". Common training methods include TF-IDF and Bag of Words. We won’t go into the details of the actual methods, however, the key idea is that each word in the input text is converted into a high-dimensional vector, much like converting the words "Apple" or "banana" into 3D vectors in the thought experiment, but typically into vectors of much higher dimensionality, such as thousands of dimensions.
How to think about it: Imagine every word in the language has a unique key. This stage turns each key into a specific weight or value that a computer can understand and work with.
2. Positional Encoding
What it does: Adds information about the position of each word in the sentence to the input embeddings. Since the model itself doesn’t handle sequences of words in the text, this helps the model know where each word occurs.
How to think about it: It’s like telling a story but making sure to emphasize the sequence of events so the story makes sense. Each word is tagged with a little reminder of its place in the line.
3. Multi-Head Attention
What it does: This is a very important and key aspect of the architecture. Attention allows the model to focus on different words at the same time. For instance, when processing a word, the model can simultaneously pay attention to its relationship with all other words in the sentence. It’s a measure of similarity between words. We won’t go too deep into that, but some common ways to measure similarity are dot product, cosine similarity, and scaled dot product (dot product divided by the square root of length of vector). A quick example using the embedding generated for our words in the experiment above. The dot product between Apple and Banana is high meaning they are quite similar words.
Apple | [0.05, 5, 0.01] |
Banana | [0.01, 6, 0.01] |
Microsoft | [5, 0.01, 0.02] |
Seattle | [0.02, 0.01, 6] |
Dot product between Apple and Banana = 0.05 0.01 + 5 6 + 0.01 * 0.01 = 30.006
Similarly, Dot product between Apple and Microsoft = 0.111
How to think about it: Consider when you read a sentence and simultaneously understand the context provided by earlier words while considering how they relate to what you’re currently reading.
Keys, Queries and Value Matrices: The transformer paper describes attention as a function of mapping a query and set of key-value pairs to an output where queries, keys, values and output are all vectors. In simple terms they determine how to focus on different parts of the input data. So what are Keys, Query and Values?
Analogy: Searching for Information in a Book
Imagine you are looking for information in a textbook about "photosynthesis." In this scenario:
Query: This is like your question or what you are specifically looking for in the book. In our case, the query is "photosynthesis."
Keys: These are like the index terms in the back of the book. Each key corresponds to a specific topic or term listed.
Values: These are the actual contents or the detailed information you find on the page numbers listed in the index under each key.
Let’s take the example described earlier and apply attention to that.
We know the words in the above picture are vectors in 3D, and we want fruit and tech companies to separate more and the ambiguous Apple should move towards tech companies or fruits based on the context.
So we multiply the fruit vector, with a matrix (keys) and multiply the tech companies’ vector with another matrix (queries).
The result of that would a be linear transformation and would change the embedding and it might look like the picture below.
This embedding now becomes better for calculating the similarities. It is however is not good for predicting the next word in the sequence which is what language model is supposed to do. The reason why this resultant embedding is good for measuring similarity but not good for predicting the next word is that the first one knows more things (features) about the words and based on that it separates the similar and dissimilar words apart. But it doesn’t know when two words could appear in the same context. This is where the values matrix comes in picture.
So we multiply the resulting embedding with another matrix called "Values".
This could transform the embedding space like the picture below.
Notice the words are now closer, that makes it more optimized to predict next word whereas multiplication with Keys and queries separated more. The embeddings produced by with multiplying with the value matrix is used to move the words around.
It brings the question, how we get the keys, queries, and values matrices. These matrices are what get trained and learned as part of the feedforward network (step 5).
4. Add & Norm
What it does: Combines the inputs and outputs of the attention mechanism and normalizes them. This ensures the values obtained after the math operations are within bounds.
How to think about it: This is akin to smoothing out and adjusting the volume of different instruments in a song so that no single instrument overwhelms the others, making the sound pleasant and balanced.
5. Feed Forward
What it does: These are the neural network that applies a set of transformations to each position separately and identically. This can be thought of as a specialized layer that adjusts features independently across each position in the sequence. The keys, queries and values matrix are learned using a loss function.
How to think about it: Think of this as a personalized adjustment for each word after considering its context, fine-tuning the understanding of each word.
How Feed Forward and Attention Work Together
Attention Mechanism: Each layer of the transformer starts with the attention mechanism, which helps the model focus on relevant parts of the input for each output element it generates. The attention mechanism uses three sets of weights (matrices) that it learns during training:
Queries: Generated based on the current input or output that is being processed.
Keys: Generated from the input data, representing different features of the input.
Values: Generated from the input data, representing the content that corresponds
to the keys.
The attention mechanism computes the alignment (or relevance) between queries and
keys, usually through a dot product, followed by a softmax function to normalize the
scores into probabilities. These probabilities determine how much each value (part of the
input) should contribute to the output at this step.
Feed Forward Network: After the attention mechanism has created a context-enriched
output, each position in this output independently passes through the Feed Forward
Network (FFN). The FFN consists of two linear transformations with a non-linear activation
function in between:
Training Key, Query, and Value Matrices
Training Process: The matrices for keys, queries, and values are trained through
backpropagation, the standard training method for neural networks.
6. Output Embedding and Linear + Softmax
What it does: Converts the decoder's output to a final word prediction. The linear layer maps the deep representation to a much larger space that represents the vocabulary, and Softmax converts these scores into probabilities.
How to think about it: This is the model making its best guess about what the next word should be, based on everything it knows up to that point, and then calculating how confident it is in each possible next word.
Conclusion
This post has covered a lot in terms of understanding how language models function. Hopefully, it has provided a glimpse into how these models learn from vast amounts of text and predict subsequent words in sequences, making interactions with Large Language Models (LLMs) seem almost human-like. In our next discussion, we will look into the potential applications of LLMs, exploring how they can be used not only for chatbot interactions but also for creating agents capable of taking action and automating business processes.
Comments