Introduction – Multiple LLM APIs
If you haven’t already, go back and read Part 1 of this series. In this guide we take a look at how you can serve multiple models in the same VM.
As you start to decide how you want to serve models as an inference endpoint you have a few options available to manage the resources needed to serve those models. You could deploy an instance per model OR you could deploy a single instance that serves multiple models. The single instance would likely have multiple GPUs that allow you to pick which GPU each model is served on.
In this part of the series we look at how you can leverage larger VM instances and pick and choose which GPU each model is using.

Tools needed
- A Massed Compute Virtual Machine
- Category: Bootstrap
- Image: LLM API
- Docker
Preparation
Managing multiple models can be confusing. There are more pieces of the puzzle to track and keep straight. A little preparation work can help make the setup steps run smoothly. Here is what you will need to do:
- Find your models of interest. In this example we will use the following:
- Determine how many GPUs you need to serve the various models. Typically the following is true, but you may have to experiment per model. In our case we are using a 2xA6000 VM because we have two 7b models we want to serve and each requires a single A6000.
- 7b models – 1 A6000
- 13b models – 1 or 2 A6000
- 34b models – 2 A6000
- 70b models – 4 A6000
- Mixture of Experts (MoE) models – 4 A6000
- Each model you want to serve as an endpoint will need a unique port. In this case we will assign the following ports:
8080– Zephyr 7b8081– OpenHermes 7b
Steps to get Setup
- Our Virtual Machines come pre-installed with Docker and have a Welcome doc. These steps will be slightly different than what is in that welcome doc. Start by navigating to Hugging Face and find a model of interest. On the model page, copy the model id, which we will use later.
- Now for some work in the terminal. Run these commands in order:
mkdir datavolume=$PWD/datamodel=HuggingFaceH4/zephyr-7b-beta
- Then run the following to spin up a Docker container serving the Zephyr model:
docker run --gpus '"device=0"' --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model- You may have to run
sudoin front of the docker command.
- You may have to run
- Next, let’s prepare and load the OpenHermes model. Open a new tab in the terminal and run the following in order:
volume=$PWD/datamodel=teknium/OpenHermes-2.5-Mistral-7B
- Then run the following to spin up a Docker container serving the OpenHermes model:
docker run --gpus '"device=1"' --shm-size 1g -p 8081:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model- You may have to run
sudoin front of the docker command.
- You may have to run
Some important pieces in these commands:
--gpus '"device=0"'– the model will use the first available GPU within the VM. It will only load the model on the first GPU.--gpus '"device=1"'– the model will use the second available GPU within the VM. It will only load the model on the second GPU.-p– the port the API request can be made to in order to hit the model.--model-id– the model the container will load. We already set the$modelvariable above.
There are many more parameters that you can learn about in Hugging Face’s documentation for Text Generation Inference.
Once the model is fully downloaded and ready to use, you can then make API requests against it.

- To make an API request against this model you will need a few variables.
From the Model Card on Hugging Face
- max_new_tokens
- temperature
- top_k
- top_p
Sometimes these can be hard to find, but most model cards have this information available.
Information from other sources
Massed Compute VM IP– when you provision a VM you will be provided an IP address.
Pull it all together
Making a request to Zephyr:
curl -X POST \ MASSED_COMPUTE_VM_IP:8080/generate \ -H 'Content-Type: application/json' \ -d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'
If you want to make streaming API requests, use the /generate_stream endpoint:
curl -X POST \ MASSED_COMPUTE_VM_IP:8080/generate_stream \ -H 'Content-Type: application/json' \ -d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'
Making a request to OpenHermes:
curl -X POST \ MASSED_COMPUTE_VM_IP:8081/generate \ -H 'Content-Type: application/json' \ -d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'
If you want to make streaming API requests, use the /generate_stream endpoint:
curl -X POST \
MASSED_COMPUTE_VM_IP:8081/generate_stream \
-H 'Content-Type: application/json' \
-d '{"inputs":"<|system|>You are a friendly chatbot.\n<|user|>Why is the sky blue?\n<|assistant|>","parameters":{"do_sample": true, "max_new_tokens": 256, "repetition_penalty": 1.15, "temperature": 0.7, "top_k": 50, "top_p": 0.95, "best_of": 1}}'
- Integrate your project with this new inference endpoint.
How can I see that my model is loaded?
A great question. There are some commands you can use to review the VM hardware to make sure your model is loaded. A slightly different command than in Part 1, but it gives you the same information: nvidia-smi. This command lets you review all the system resources.
Before a model is loaded, the GPU memory will be largely free.

After a model is loaded, you will see the GPU resources fully holding the model.

Conclusion
Leveraging Hugging Face’s Text Generation Inference, you can easily deploy full unquantized models in a matter of minutes. Depending on your preferred style to manage resources, you may find yourself wanting to manage all your inference endpoints in a single resource. This guide should help show how leveraging TGI Docker containers allows you to load multiple models across a larger VM.











