Tech Talk: A Lesson in LLMs
Exploring the processes that make chatGPT the powerhouse that it is.
By now, the hype around Large Language Models (LLMs) has reached a fever pitch. Everyone is racing to integrate these impressive applications of math and language into their software/services, allowing users to take advantage of their power within the context of their business. Amidst all this craziness, we think it’s important to slow down occasionally and take stock of all the things that make these models so cool! In this series of articles, we are going to get a little more technical than we usually do (just a little, we promise) and dive into the architecture and systems of LLMs to help make this state-of-the-art AI, which often seems like incomprehensible magic, into something a little more tangible. Who knows, maybe you’ll learn something that’ll impress your friends (or your boss).
More than meets the eye.
You probably get the gist of what a Large Language Model (LLM) can do: Put in a prompt and get a response. You also probably know a couple of the following key talking points about the models themselves: billions of parameters, trained on millions of lines of text, capable of generating comprehensive instruction sets or works of literature, yada-yada. But how much do you really know about LLMs or, more accurately, transformers; the model architecture behind the gargantuan AI services, like chatGPT, that you have come to know and love and have probably used to make your job just a tiny bit easier (I know I certainly have).
If you’ve never heard of transformers before, you’ve come to the right place. This article—the first in our slightly more technical series on LLMs—is going to provide a bird’s eye view of the architecture and processes that go into making these tools work, as well as some of the challenges and difficulties working with these complex machines. For now, we’re just going to focus on three major questions:
- What is a transformer?
- How does it work?
- How did we teach it to do all this?
As mentioned previously, there will be other articles in this series that will offer a deeper focus on some of the details mentioned here, but for now, we hope that this serves as a good introduction to the models themselves and some of the vocabulary that we will use throughout this series.
What is a Transformer?
To understand what makes a transformer unique, I think it is highly beneficial to view it in comparison to other models that you may have heard about. Many models in classical machine learning and artificial intelligence fall into the category of classification models or regression models. There are nuances to the differences between these models, but they have one thing in common: they are only good at processing singular observations with numeric data. In other words, it processes one record-label pair at a time and for each input, there is only one output that it can predict once trained: either a number, in the case of a regression model, or a label, in the case of a classification model. All the content in that record must be a number and all records fed into the model must have a standard set of features of equal length and content. Now to get you engaged in thinking about the models, I want to ask you: why do you think this would be a major problem for models like the one behind chatGPT?
It’s because we’re not predicting single observations, but rather a sequence of observations in the form of a text string. For example, let’s say we wanted to predict the undergraduate major for an individual based on a set of data that we have. We would take the data associated with that person, pass it into a classification model and receive a single output, such as “Marketing” or “Philosophy”, depending on what the model predicts based on the data in that single observation (i.e. the person’s data). Now, what happens if we want to use a model to generate a congratulations message for that person? We still need to know their undergraduate major, but just knowing the major won’t help us because these are completely different problems to solve! For this we will need a model capable of processing, learning from, and generating sequences of data from inputs like: “Write a congratulations letter for a graduate with a degree in data science program.”
The models capable of doing so are called sequence-to-sequence models and they are a special application of generative models. We won’t cover other architectures in this article, but if you are interested, some that you might be familiar with in this domain are RNNs, GANs, and Stable Diffusion models. The sequence-to-sequence model archetype is pretty self-explanatory, but to be explicit: it takes in a sequence of data (most often a natural language string of text) and produces a sequence in return. It is capable of doing this because, even though it requires a standard number of inputs like the other models, it is capable of understanding what an empty space is by using special tokens. As an aside, tokens is a general term used for each unit of an input sequence passed into the model, but for the time being you can think of these as individual words.
Above are two sequences that can be provided to the same transformer architecture. One sequence (top) uses all available input positions for the transformer, while the other (bottom) uses special tokens to ensure consistent length.
The model is capable of taking in both of these token sequences because it is capable of learning the meaning of the blank token, `[BLK]`, in addition to the meanings of the start and stop tokens, `[BEG]` and `[END]`. These special tokens do a lot of the work in understanding the relative position of each token in the input sequence. This is most explicitly relevant to the process of training, but before we get to that, it’s important to develop a deeper understanding of the transformer’s internal systems and how they contribute to the “learning” of the model.
How does it work?
It is impossible to talk about transformers without discussing the underlying mechanism that made them so revolutionary: self-attention. The original paper Attention is All You Need (Vaswani, et al., 2017) is a highly technical (and very mathy) exploration of the original architecture and I commend you if you attempt to read it, but a colloquial exploration of the mechanism should be enough to understand what is happening when a machine is “learning” the language that it is being fed.
Imagine that you have a group of friends and you are trying to plan a weekend getaway. There are a lot of things that you need to consider for each individual; what they like, what they are capable of doing, how much they are willing to contribute to the trip, and so on. It makes sense to look at them individually, but it would be more valuable to look at them interdependently. This means looking not only at each person but their dynamics and social relationships with others in the group in addition to their individual traits. For example, suppose you’ve decided to rent a house for everyone to stay in and are deciding on room arrangements. Considering people individually may assign people to different rooms based on their individual preferences and needs, but if two of them are married, it makes no sense to assign them independently and have them sleep in separate rooms. The interdependent consideration allows you to optimize much better and enables everyone to be more comfortable with the arrangements in the end. This is, in essence, what the self-attention mechanism in a transformer does.
By having a set of values associated with each token in a sequence, we can encode information not only about the information in each token but about its position and relationship to all other tokens in the sequence. This is important because, as you well know, context matters a lot in conversations, especially when dealing with homonyms like read and bark. Using the self-attention mechanism (and some rather complicated mathematics) it is possible to capture the semantic meanings and associations that we all know as native speakers within the parameters of a model. These parameters, which exist in multiple parts of the transformer architecture including the self-attention mechanism, are sets of numeric values that are used to map the associations from training data and apply them to future inputs. Every time you ask chatGPT a question, the model uses the parameters inside of it to predict possible tokens that it should return at each point in the response sequence.
At this point, we have covered the broad strokes of transformers, but one question remains: how do we go from the words we put into the chat box for models like chatGPT to the sets of learned numbers in the parameters of a model? The answer is through the use of something called embeddings. Embeddings are a tricky thing to conceptualize and will be receiving their own article in this technical series to help you understand them like a pro, but the simplest way to view them is as location values (i.e. coordinates) where similar tokens are placed close together.
Let’s return to the example of your group of friends and the trip you’re planning. Suppose you rent out a private room at a restaurant for the first night to celebrate your trip. You wouldn’t want to position people around the table in a way that dissimilar people are not next to each other because they would have nothing to talk about or, at worse, start a fight over something they are both passionate about. To do this, you come up with a single word that defines their traits, hobbies, or interests and use this to evaluate how well they’d get along. If one person is really into hunting and another is really into hiking you might put them together, but if one person in attendance is a vegan activist, it is likely that you might place them on the side of the person into hiking, but as far away from the person into hunting as possible. If you gave each seat a set of numbers to identify its position, this set of numbers would be the embedding for that associated word. An important note is that the process of calculating the embeddings for each word requires an extensive “training” process, but this example should represent the concept adequately for the time being.
How did we teach it to do all this?
Okay, so now that we have an understanding of what goes on inside the model we can begin to think about how these components work together in the training process so that a LLM can learn to respond to any inputs in a reasonably natural manner. The simplest way to understand the training procedure is as a “guess and check” system, wherein a token in the input sequence is hidden and the model attempts to fill in the missing token using the parameters it has inside of it. If the model guesses correctly, the parameters are solid and do not need to be changed; however, if it guesses incorrectly, the numbers in the parameters are modified slightly to reflect a change needed so it can guess correctly in the future.
Additionally, fine-tuning is a similar process except the model has already been trained so the sets of values in the parameters it is modifying at each step in this process are already learned from training on a general dataset of input sequences.
As you may have noticed, this is more akin to a fill-in-the-blank method of prediction and is not the typical question-and-answer application of LLMs that you are familiar with. This is because the responses that you receive from the models are not answers, but rather completions. This means that when you submit a message to chatGPT, instead of responding to your input prompt, it is providing predicted tokens as though it is filling in at the end of a scripted dialogue; in other words, the robot is essentially talking to itself and has no idea that you even exist.
If you happened to develop an emotional connection with one of these models before reading this article: I’m sorry. Otherwise, this about wraps it up for our first foray into this slightly-more-technical series and I hope that you have enjoyed learning about these revolutionary new models that are rapidly changing the landscape of modern business and workflows.
We hope that now when you encounter the concepts around transformers like embeddings, self-attention, and training/fine-tuning you will be more equipped to participate in discussions around LLMs and understand what’s going on under the hood a bit better when you use them. Additionally, this and other articles in this series will allow you to engage better with our development diaries as we pursue the exciting new endeavor of boodleGPT, an LLM integration that helps you think about, analyze, and act on your customer/donor data.
Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, N