At the recent AWS Global Summit in NYC, the mention of a course which was a collaboration between AWS and Deeplearning.AI called Generative AI with Large Language Models which seemed like a good way to learn more about the workings of LLMs. I recognized Andrew Ng from his hit AI for Everyone course when I was last working on Kaggle competitions and decided to go through the course to see how the new wave of Generative AI was similar to the Machine Learning challenges I’d done in the past. It was interesting to see both the technical process of training an LLM compared to my previous Kaggle blogs as well as the recent history of papers that got us to the point we are today. This post is a blog version of my notes and takeaways from the first week of the 3 week course, covering the basics of generative AI, transformers, prompt engineering and the LLM project lifecycle.
Generative AI
Generative AI is a subset of machine learning with models having learned human-like abilities based on finding patterns in their training data sets. They use data sets originally generated by humans at an extremely large scale, with today’s top models being trained on trillions of words over the course of weeks/months and billions of parameters being tuned to achieve their current performance. These “foundation” models are usually general purpose models which can a company may wish to train further to create an AI with a specific domain expertise. Some of the most popular ones today are GPT, BLOOM, FLAN-T5, LLama, PaLM and BERT. Starting with a foundation model can save a lot of time over training a new model from scratch.
Once you have a machine learning model, or LLM, you use it by passing a prompt of an allowed size. The model takes the prompt and predicts the words which will follow it, outputing them in a Completion. LLMs can be useful for a variety of tasks such as summarizing or asking questions of text, translating languages, or generating code. With additional setup, LLMs can utilize external APIs to build their knowledge.
Transformers
One recent invention that’s fueled the Generative AI craze was new innovation called a Transformer which allows for the processing of much larger inputs and contexts than the previously used RNNs. In 2017, Google and the University of Toronto collaborated on the paper, Attention is All You Need which introduced the concept to the public and led to todays technological revolution. The transformer can be scaled to use multi-core GPUs and parallel process input data to handle larger inputs than before, letting it pay attention to meaning itself. To do this, the Transformer learns the relation of each word to all other words and learns relationships between them, applying attention weights to relationships to learn contexts. These weights are learned during LLM training and used during prediction.
Transformers have two distinct parts, an encoder and a decoder which work together to train LLMs and generate text. The encoder encodes inputs and prompts with contextual understanding and produces one vector per input token. The Decoder accepts input tokens and generates new output tokens. Different types of models make use of different combinations of encoders and decoders. Encoder-only models such as BERT are useful for sentiment analysis (was a review positive or negative), Decoder-only models such as GPT, BLooM, and LLAMA are some of the most commonly used ones today. BART and T5 are examples of models which use both encoder and decoder which are good for sequence to sequence tasks like translation and text generation as well.
Tokenization
Machine learning models are effectively statistical calculators and they work using numbers not words, so words must be tokenized before being passed to a model. Tokenization is a process which converts words into numbers with each number representing a position of all possible words the tokenizer can work with. These token IDs can be full words or parts of words but once you select the tokenizer to train the model with, you must use the same to generate text. Once tokenized, the word data is passed to an embedding layer which is a trainable vector embedding space, a high dimensional space where each token is represented as a vector occupying a unique location in the space. Positional and token embeddings are both added to avoid losing relevance of word order. Once weights are applied, outputs are processed through a feed-forward network. The output is a vector of logits proportional to the probability score of each token in the tokenizer dictionary. These go to a softmax layer where they’re normalized to a probability score of each possible word. Lastly, one token will be predicted for having the highest score. The output (including original input) is run as a new input and this process is repeated until an end of sequence token (one of the characters in the dictionary) is generated.
Prompt Engineering
Prompt engineering is the technique of tuning the prompt that’s passed to an LLM to attempt to get a better response and is good to try before switching attention to updating the model itself. For instance, providing an example in the prompt of passed to the model improves performance. Where that isn’t sufficient, you can try passing multiple prompts. These examples are known as one-shot inference versus few-shot inference respectively and are both examples of in-context learning. Smaller models may benefit more from in-context learning, while larger models are better at generalizing. Still, there is a limit to the context window and if 5 or 6 examples aren’t enough, it’s generally a sign the model needs to be trained further.
Training Parameters and Configuration
When training a model, a variety of different parameters can be used to change how the model behaves. A few of these are
- Max new tokens – Limit to how many tokens the model generates
- Sampling type:
- Greedy selects highest probability score (most common)
- Random adds randomness via strategies such as
- Top-k: Select output from top-k results after applying random-weighted strategy with probabilities
- Top-p: limits random samplings to predictions with combined probabilities not exceeding p
- Temperature: Higher temperature means higher randomness
Generative AI Project LifeCycle
AWS and DeepLearning.AI suggest a project lifecycle for an LLM-based application.
- Determine scope and define use case
- Decide whether to use an existing base model or train your own
- Generally, you use a foundation model
- Adopt and align the model via prompt engineering, fine-tuning, etc.
- Update, evaluate, repeat
- Application Integration
- Optimize and deploy model
- Update or build app to utilize it
Training our first LLM – Summarizing Dialogue
To round out the first week, the course has a lab in which we used a FLAN-T5 model from Huggingface to summarize dialogue, training the model on the DialogSum dataset which consists of 10,000 dialogues with manually added summaries and topics. The Lab walks us through using PyTorch and Huggingface’s provided datasets and transformers Pip packages. The course provides a SageMaker Studio environment to run the process of importing the model, dataset, and tokenizer, encoding our prompt and using the model to generate a completion. It serves as a hands on introduction to using LLMs with Python and demonstrates using in-context learning to improve responses.
The first week of the course provided a good introduction to Generative AI and how it works as well as providing a hands on exercise for utilizing model via Python, SageMaker and PyTorch. Notably, it was neat to get a better understanding of the transformer itself and see via SageMaker and PyTorch how similar much of the code for training and utilizing models is to the machine learning of a few years back. The next week goes deeper on training, scaling and evaluating the performance of these models and I’m enjoying learning more about the inner workings of these revolutionary technologies.





Leave a Reply