Impact of updated NVIDIA drivers on vLLM & HuggingFace TGI

If you are building a service that relies on LLM inference performance, you want to know how to get the most tokens per second. There are various factors that can go into finding the right configuration to maximize your tokens per second throughput. With the recent release of NVIDIA drivers v560 and some updated version of vLLM, we tested how the new drivers impact inference performance for HuggingFace TGI and vLLM inference engines.

Experimental Setup

To truly test inference performance we kept the GPU, model, and settings all the same. Here was the setup

Instance: 1xL40

Model: meta-llama/Llama-3.1-8B

Benchmark Software: Native HuggingFace TGI & vLLM benchmarks

Test: Each configuration was ran five times to calculate an average result.

Commands for Test

This model does require a HuggingFace Token. You must first create a HuggingFace account and then request access to the Llama 3.1 – 8B model.

# TGI Setup
docker run --gpus all --shm-size 10g -e HUGGING_FACE_HUB_TOKEN=TOKEN -p 8080:80 -v $PWD/data:/data
ghcr.io/huggingface/text-generation-inference:2.2.0
--model-id meta-llama/Meta-Llama-3.1-8B-Instruct

To run the benchmark for TGI you have to first get into the container to run the bechmark

# TGI Benchmark
docker exec -it CONTAINER_ID bash

# Run benchmark
text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3.1-8B-Instruct

 

For vLLM we elected to use python instead of their docker container. This is because in our experience the python version of benchmark is easier to run.

# vLLM Benchmark
# Install venv
sudo apt update && sudo apt install python3.10-venv -y
# make venv for vllm 0.6.0
python3.10 -m venv venv-0.6.0
# activate venv
source venv-0.6.0/bin/activate
## Install vllm 0.6.0
pip install vllm==0.6.0

# Make Endpoint
HUGGING_FACE_HUB_TOKEN=TOKEN SD python -m vllm.entrypoints.openai.api_server
--model meta-llama/Meta-Llama-3.1-8B-Instruct --port 8081 --enable-chunked-prefill
--disable-log-requests --max-num-batched-tokens 512 --max-model-len 8000

For the vLLM benchmark you do have to download some sample data to run the benchmark.

# Setup for benchmark
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks/
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

 

Once the data is downloaded the testing is straight forward. In our vLLM setup above we used venv. Just make sure you activate the venv before running the benchmark

# vLLM Benchmark
python ./benchmark_serving.py --model meta-llama/Meta-Llama-3.1-8B-Instruct
--tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct --endpoint /v1/completions
--dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 10.0 --host 127.0.0.1 --port 8081

Raw Data Results

Since we ran each configuration 5 times we will present a summarized view. The raw results will also be posted for you to review.

Drivers 550 HuggingFace TGI v2.20

Step Batch Average (tok/sec) Average (tok/sec) Average (tok/sec) Average (tok/sec) Average (tok/sec)
    Test 1 Test 2 Test 3 Test 4 Test 5
Prefill 1 37.29 37.48 37.04 37.09 37.05
  2 72.00 72.13 70.73 72.68 71.95
  4 139.44 141.63 141.41 142.10 143.07
  8 257.49 255.52 254.80 261.37 257.15
  16 426.41 421.39 413.90 423.00 422.31
  32 577.57 573.21 572.76 578.97 579.30
Decode (Total) 1 41.35 41.42 41.29 41.35 41.42
  2 79.56 79.27 79.41 79.59 79.42
  4 156.88 156.72 157.11 157.30 157.09
  8 307.80 306.52 308.11 307.74 307.44
  16 592.43 593.01 572.42 593.17 590.94
  32 1141.08 1141.45 1141.56 1144.43 114.36

 

Drivers 560 HuggingFace TGI v2.20

Step Batch Average (tok/sec) Average (tok/sec) Average (tok/sec) Average (tok/sec) Average (tok/sec)
    Test 1 Test 2 Test 3 Test 4 Test 5
Prefill 1 38.82 38.82 37.26 36.94 37.24
  2 72.32 72.47 72.92 72.93 73.29
  4 141.16 141.70 140.80 142.63 141.81
  8 257.74 256.40 258.38 257.93 256.04
  16 422.23 420.13 423.19 417.03 428.96
  32 574.73 576.95 567.79 572.52 577.00
Decode (Total) 1 41.42 41.24 41.43 41.39 41.40
  2 79.56 79.66 79.83 79.60 79.92
  4 156.85 157.04 157.06 157.43 156.75
  8 306.70 307.77 307.40 307.93 307.97
  16 593.81 592.58 593.54 591.21 595.16
  32 1141.28 1149.73 1139.41 1137.00 1148.25

To eventually compare to vLLM we are going to look at the average Decode numbers for batch 32.

550 Drivers: 1141.976 tokens/sec

560 Drivers: 1143.124 token/sec

0.1% change…not much due to driver changes

Drivers 550 vLLM v0.5.5

  Test 1 Test 2 Test 3 Test 4 Test 5
Output (token/sec) 1539.11 1529.79 1539.63 1529.50 1547.55
Mean First token (ms) 217.55 220.10 211.00 216.23 201.56
Mean Time per output (ms) 84.19 87.19 85.06 85.74 81.83

 

Drivers 560 vLLM v0.5.5

  Test 1 Test 2 Test 3 Test 4 Test 5
Output (token/sec) 1571.48 1566.79 1568.16 1571.81 1572.17
Mean First token (ms) 179.83 180.90 179.92 172.83 180.04
Mean Time per output (ms) 75.55 75.70 75.54 73.98 75.58

Again comparing the same version on different NVIDIA drivers shows a minor increase in performance

550 Drivers: 1537.116

560 Drivers: 1570.082

2.14% change.

 

Drivers 550 vLLM v0.6.0

  Test 1 Test 2 Test 3 Test 4 Test 5
Output (token/sec) 1636.22 1642.73 1640.97 1641.22 1638.04
Mean First token (ms) 213.35 202.93 204.23 204.07 201.70
Mean Time per output (ms) 56.68 55.12 55.47 55.45 55.02

 

Drivers 560 vLLM v0.6.0

  Test 1 Test 2 Test 3 Test 4 Test 5
Output (token/sec) 1642.97 1644.65 1642.63 1632.43 1638.45
Mean First token (ms) 197.31 197.54 179.92 172.83 180.04
Mean Time per output (ms) 54.29 54.16 54.27 55.77 55.77

Another case of marginal token per second increase when changing drivers.

550 Drivers: 1639.836

560 Drivers: 1640.226

0.02% change.

Conclusion

The quick conclusion is that NVIDIA drivers update has no impact on inference performance. However, upgrading from HuggingFace TGI to vLLM and upgrading vLLM versions does impact speed. When going from v0.5.0 to v0.6.0 you see close to 5% increase in tokens per second.

If you like in-depth articles like this please reach out with ideas on other research topics you would like to see. If you are in the market to rent GPUs please condsider renting from us. Sign Up and use coupon code `MassedComputeResearch` for 15% off any A6000 rental.