Leverage Hugging Face’s TGI to Create Large Language Models (LLMs) Inference APIs – Part 2

Introduction – Multiple LLM APIs

If you haven’t already, go back and read Part 1 of this series. In this guide we take a look at how you can serve multiple models in the same VM.

As you start to decide how you want to serve models as an inference endpoint you have a few options available to manage the resources needed to serve those models. You could deploy an instance per model OR you could deploy a single instance that serves multiple models. The single instance would likely have multiple GPUs that allow you to pick which GPU each model is served on.

In this part of the series we look at how you can leverage larger VM instances and pick and choose which GPU each model is using.

Hugging Face Text Generation Inference

Tools needed

  • A Massed Compute Virtual Machine
    • Category: Bootstrap
    • Image: LLM API
  • Docker

Preparation

Managing multiple models can be confusing. There are more pieces of the puzzle to track and keep straight. A little preparation work can help make the setup steps run smoothly. Here is what you will need to do:

  1. Find your models of interest. In this example we will use the following:
    1. Hugging Face – Zephyr 7b
    2. Teknium – OpenHermes v2.5 Mistral 7b
  2. Determine how many GPUs you need to serve the various models. Typically the following is true, but you may have to experiment per model. In our case we are using a 2xA6000 VM because we have two 7b models we want to serve and each requires a single A6000.
    1. 7b models – 1 A6000
    2. 13b models – 1 or 2 A6000
    3. 34b models – 2 A6000
    4. 70b models – 4 A6000
    5. Mixture of Experts (MoE) models – 4 A6000
  3. Each model you want to serve as an endpoint will need a unique port. In this case we will assign the following ports:
    1. 8080 – Zephyr 7b
    2. 8081 – OpenHermes 7b

Steps to get Setup

  1. Our Virtual Machines come pre-installed with Docker and have a Welcome doc. These steps will be slightly different than what is in that welcome doc. Start by navigating to Hugging Face and find a model of interest. On the model page, copy the model id, which we will use later.
  2. Now for some work in the terminal. Run these commands in order:
    • mkdir data
    • volume=$PWD/data
    • model=HuggingFaceH4/zephyr-7b-beta
  3. Then run the following to spin up a Docker container serving the Zephyr model:
    • docker run --gpus '"device=0"' --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
      • You may have to run sudo in front of the docker command.
  4. Next, let’s prepare and load the OpenHermes model. Open a new tab in the terminal and run the following in order:
    • volume=$PWD/data
    • model=teknium/OpenHermes-2.5-Mistral-7B
  5. Then run the following to spin up a Docker container serving the OpenHermes model:
    • docker run --gpus '"device=1"' --shm-size 1g -p 8081:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
      • You may have to run sudo in front of the docker command.

Some important pieces in these commands:

  • --gpus '"device=0"' – the model will use the first available GPU within the VM. It will only load the model on the first GPU.
  • --gpus '"device=1"' – the model will use the second available GPU within the VM. It will only load the model on the second GPU.
  • -p – the port the API request can be made to in order to hit the model.
  • --model-id – the model the container will load. We already set the $model variable above.

There are many more parameters that you can learn about in Hugging Face’s documentation for Text Generation Inference.

Once the model is fully downloaded and ready to use, you can then make API requests against it.

Successful API request to a TGI inference endpoint
  1. To make an API request against this model you will need a few variables.

From the Model Card on Hugging Face

  • max_new_tokens
  • temperature
  • top_k
  • top_p

Sometimes these can be hard to find, but most model cards have this information available.

Information from other sources

  • Massed Compute VM IP – when you provision a VM you will be provided an IP address.

Pull it all together

Making a request to Zephyr:

curl -X POST \
MASSED_COMPUTE_VM_IP:8080/generate \
-H 'Content-Type: application/json' \
-d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'

If you want to make streaming API requests, use the /generate_stream endpoint:

curl -X POST \
MASSED_COMPUTE_VM_IP:8080/generate_stream \
-H 'Content-Type: application/json' \
-d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'

Making a request to OpenHermes:

curl -X POST \
MASSED_COMPUTE_VM_IP:8081/generate \
-H 'Content-Type: application/json' \
-d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'

If you want to make streaming API requests, use the /generate_stream endpoint:

curl -X POST \
  MASSED_COMPUTE_VM_IP:8081/generate_stream \
  -H 'Content-Type: application/json' \
  -d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'
  1. Integrate your project with this new inference endpoint.

How can I see that my model is loaded?

A great question. There are some commands you can use to review the VM hardware to make sure your model is loaded. A slightly different command than in Part 1, but it gives you the same information: nvidia-smi. This command lets you review all the system resources.

Before a model is loaded, the GPU memory will be largely free.

nvidia-smi showing GPU memory before the models are loaded

After a model is loaded, you will see the GPU resources fully holding the model.

nvidia-smi showing multiple models loaded across GPUs

Conclusion

Leveraging Hugging Face’s Text Generation Inference, you can easily deploy full unquantized models in a matter of minutes. Depending on your preferred style to manage resources, you may find yourself wanting to manage all your inference endpoints in a single resource. This guide should help show how leveraging TGI Docker containers allows you to load multiple models across a larger VM.