From the course: LLaMa for Developers

Running LLaMA in a notebook - Llama Tutorial

From the course: LLaMa for Developers

Running LLaMA in a notebook

- [Narrator] Let's run Lama in our own cloud environment. You can find the code for this exercise under Branch 01_02 on GitHub. For this course, I'm going to be using Google Colab, which is a free Jupyter Notebook environment. Let's go ahead and launch it by hitting Open Colab. So in Colab, I can start off with either uploading a file or using a GitHub integration. On the GitHub integration side, I have my personal GitHub here, where we can go ahead and enter the LinkedIn Learning account. At the GitHub integration, you'll be able to clone either your own repository or any public repository. Let's go ahead and select our Llama course and select our branch, which is 01_02. In this case, I've already downloaded the code, so let's go ahead and upload it. It's 01_02 as a Jupyter Notebook. Let's hit open and here we go. So at the top here, I have some instructions for running this model. While Llama is an open source model, it does have some conditions associated to it. Let's go ahead and read them. So to gain access to Llama, you need to request it by having your first name, last name, and some personal information filled out. You can then fill out which models you want access to and read this agreement to make sure you understand what's going on. Now, the general agreement states they won't do anything harmful with the models, but I'd go through and give this a thorough read. After accepting the conditions, we can go ahead and gain access to the model in Hugging Face. I'm here under the meta-llama organization with Llama 2 7-b chat Hugging Face. Now you can see here, it stated that this is a gated model, but I've been granted access. So in your screen, you may see something like this, Access Llama 2 on Hugging Face. So in this case, you'll enable Llama 2 on Hugging Face by gaining access through meta. So that's the form that we saw earlier. Now, when you gain access, you should receive an email, looking something like this. This is to let you know your request to access model, meta-llama/Llama-2-7b on Hugging Face.co has been accepted. All right, great. Now let's head back to our repository. So we've read the license, we requested access to Hugging Face Llama 2. Now let's go ahead and create our Hugging Face token. Now, to do so, we need to go ahead and go back to Hugging Face and create an account. In this case, it was pretty easy, just using my Gmail. Afterwards, we can go to Settings, go to Access Tokens, and grab our tokens here. I've created one called LinkedIn. I'm going to copy my token. Let's head back to our repository and this co-op notebook, let's scroll down and add our token. We can see right here on this access token portion. I'm going to go ahead and paste it. Okay, let's scroll to the top. Now, our next step is to install these libraries. So we're going to install the Transformers Library, Bits and Bytes and Accelerate. Let's go ahead and do that. We're going to click into this box here and hit Shift Enter. Now what'll happen here is we'll initialize our GPU access, in this case to a T4 GPU, and we're going to install all our repository dependencies. All right, let's go to the next portion. And in this portion, we're going to import our Python dependencies, transformers and PyTorch. And here we have our model name, which is Llama2-7b-chat-hf, our access token and our quantization configuration. We don't need to worry too much about that. We'll discuss it in another video. And these final two lines, we're going to download the model and the tokenizer. This will take a little bit of time. Now we can see here, the model has been downloaded and we have initialized the tokenizer. If we keep scrolling down, we can see we can run our prompt. So I'm going to run this prompt. What is two plus two, answer concisely? Let's hit Shift Enter. Now looking at the previous output, you could see that Llama 2 7 billion parameters can be a little bit verbose. So let's hope that this time, the answer is actually concise. All right, here we go, we got a response. In this case, the response was quite verbose, and that's because LAMA 2 7 billion parameters is the weakest out of the Llama family. Now, the reason that we had to use a weaker model is because we had to fit it on just one small GPU. We're running a T4 that has 18 gigabytes of video memory. Now that might seem like a lot, but for large language models, that's a very small amount. If we scroll up, you can also see that we're running the model in a four bit quantized mode, meaning that we're only using four bits, rather than the standard 32 bits for holding the model parameters. This will also affect the performance. So there we go. That's how we can run our model in a Colab environment. In the next video, we're going to cover how we can run a Llama model in an enterprise environment.

Contents