Hosting Open Source LLM model on Google Colaboratory
In this article, we will be going over how to pick any off-the-shelf model from HuggingFace and host it on Google Collaboratory. This article assumes you’re familiar with Jupyter Notebook, if not it's easy to learn here. The final Jupyter Notebook is at the bottom of the article.
Model Selection
First, we will select a model from HuggingFace; Since we will be using llama.cpp to leverage a model on a GPU we have to select models within GGUF Format. Typically the main contributor is TheBloke who converts the models from their .bin format to .GGUF format so that we can run it better with more efficiency on a consumer-grade GPU. For this article, we will be using the LLaMA 2 7B model.
Google Colaboratory Setup
Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. After creating your .ipynb document don't forget to go to “Runtime” -> “Change runtime type” -> “T4 GPU”. This will allow your notebook access to a GPU.
We are going to first create a new cell and build the LLaMA.cpp dependency. Since we are going to use Jupyter Notebooks we have to use the LLaMA-cpp-python port in order for us to gain access to a GPU.
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python
Letting that run will build all the necessary tooling for our code to have GPU acceleration.
Now we install all the code dependencies in another cell:
!pip install fastapi[all] uvicorn python-multipart transformers pydantic tensorflow
Since the environment is on a temporary session on a Google Colab server, if we want to access it we are going to have to use Ngrok to port local addresses to public ones.
head over to https://dashboard.ngrok.com/signup to create an account and get your API key.
Lets download and unzip Ngrok to our instance using this:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip -o ngrok-stable-linux-amd64.zip
We also have to add our API key to the server instance:
!./ngrok authtoken <YOUR_NGROK_API_TOKEN>
Creating the Server
To make a server file we are going to use Fast API for our HTTP Library. We will also have to download the model to our GPU and let LLaMA-cpp-python be able to access the GPU and model, using the API we can /generate texts to call our LLaMA model.
%%writefile app.py
from typing import Any
from fastapi import FastAPI
from fastapi import HTTPException
from pydantic import BaseModel
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import tensorflow as tf
# GGML model required to fit Llama2-13B on a T4 GPU
GENERATIVE_AI_MODEL_REPO = "TheBloke/Llama-2-7B-GGUF"
GENERATIVE_AI_MODEL_FILE = "llama-2-7b.Q4_K_M.gguf"
model_path = hf_hub_download(
repo_id=GENERATIVE_AI_MODEL_REPO,
filename=GENERATIVE_AI_MODEL_FILE
)
llama2_model = Llama(
model_path=model_path,
n_gpu_layers=64,
n_ctx=2000
)
# Test an inference
print(llama2_model(prompt="Hello ", max_tokens=1))
app = FastAPI()
# This defines the data json format expected for the endpoint, change as needed
class TextInput(BaseModel):
inputs: str
parameters: dict[str, Any] | None
@app.get("/")
def status_gpu_check() -> dict[str, str]:
gpu_msg = "Available" if tf.test.is_gpu_available() else "Unavailable"
return {
"status": "I am ALIVE!",
"gpu": gpu_msg
}
@app.post("/generate/")
async def generate_text(data: TextInput) -> dict[str, str]:
try:
print(type(data))
print(data)
params = data.parameters or {}
response = llama2_model(prompt=data.inputs, **params)
model_out = response['choices'][0]['text']
return {"generated_text": model_out}
except Exception as e:
print(type(data))
print(data)
raise HTTPException(status_code=500, detail=len(str(e)))
If you would like to use a different model then LLaMA 2 all you would need to change is the “GENERATIVE_AI_MODEL_REPO” and “GENERATIVE_AI_MODEL_FILE”. The “GENERATIVE_AI_MODEL_REPO” is the direct repo name from HuggingFace. The “GENERATIVE_AI_MODEL_FILE” is the specific model file you want to use which you can find in the middle of the page, depending on what use case you would like, generally, I go for the recommended which is “medium, balanced quality ”.
This next code will start up the server giving you feedback on how long it has been (Max 10 mins) also the command log files get exported to server.log within “Google Colabs Files” on the left-hand side:
# The server will start the model download and will take a while to start up
# ~5 minutes if its not already downloaded
import subprocess
import time
from ipywidgets import HTML
from IPython.display import display
t = HTML(
value="0 Seconds",
description = 'Server is Starting Up... Elapsed Time:' ,
style={'description_width': 'initial'},
)
display(t)
flag = True
timer = 0
try:
subprocess.check_output(['curl',"localhost:8000"])
flag = False
except:
get_ipython().system_raw('uvicorn app:app --host 0.0.0.0 --port 8000 > server.log 2>&1 &')
res = ""
while(flag and timer < 600):
try:
subprocess.check_output(['curl',"localhost:8000"])
except:
time.sleep(1)
timer+= 1
t.value = str(timer) + " Seconds"
pass
else:
flag = False
if(timer >= 600):
print("Error: timed out! took more then 10 minutes :(")
subprocess.check_output(['curl',"localhost:8000"])
Testing our API
In order to test our API we have to open the server localhost to the public using Ngrok and also hit the “/generate” endpoint.
This code will get the current localhost and also print out what the API URL is:
# This starts Ngrok and creates the public URL
import subprocess
import time
import sys
import json
from IPython import get_ipython
get_ipython().system_raw('./ngrok http 8000 &')
time.sleep(1)
curlOut = subprocess.check_output(['curl',"http://localhost:4040/api/tunnels"],universal_newlines=True)
time.sleep(1)
ngrokURL = json.loads(curlOut)['tunnels'][0]['public_url']
%store ngrokURL
print(ngrokURL)
Our API Driver code is here:
import requests
# Define the URL for the FastAPI endpoint
%store -r ngrokURL
# Define the data to send in the POST request
data = {
"inputs": '''
Write an SQL Statement to find number of rows in a table
''',
#paramaters can be found here https://abetlen.github.io/llama-cpp-python/#llama_cpp.llama.Llama.create_completion
"parameters": {"temperature":0.1,
"max_tokens":200}
#higher temperature, more creative response is, lower more precise
#max_token is the max amount of (simplified) "words" allowed to be generated
}
# Send the POST request
response = requests.post(ngrokURL + "/generate/", json=data)
# Check the response
if response.status_code == 200:
result = response.json()
print("Generated Text:\n", data["inputs"], result["generated_text"].strip())
else:
print("Request failed with status code:", response.status_code)
This will use the prompt we provided and call the API we made, we are free to change the input and the parameters.
Closing Thoughts
This project was super fun to figure out because it's not every day that you can spin up an instance and host a model for your application to call. This article makes it super easy to do so. I have attached the notebook here.