LLama 3.1 Benchmark Across Various GPU Types

Figure: Generated from our Art VM Image using Invoke AI

Previously we performed some benchmarks on Llama 3 across various GPU types. We are returning again to perform the same tests on the new Llama 3.1 LLM. On July 23, 2024, the AI community welcomed the release of Llama 3.1 405B, 70B and 8B models. These models are the next version in the Llama 3 family. Below is a video of Mark Zuckerberg discussing the release of Llama 3.1 and other AI topics.

Just a note before going further. The structure of this post is very similar to the benchmark of Llama 3 70B. Note that all the data provided below is in fact for Llama 3.1 70B.

The GPUs Tested

Before diving into the results, let’s briefly overview the GPUs we tested:

  • NVIDIA A30: Professional-grade graphics card designed for data centers and AI applications, offering high-performance computing, advanced memory, and energy efficiency.
  • NVIDIA A6000: Known for its high memory bandwidth and compute capabilities, widely used in professional graphics and AI workloads.
  • NVIDIA L40: Designed for enterprise AI and data analytics, offering balanced performance.
  • NVIDIA A100 SXM4: Another variant of the A100, optimized for maximum performance with the SXM4 form factor.
  • NVIDIA H100 PCIe: The latest in the series, boasting improved performance and efficiency, tailored for AI applications.

Benchmarking Methodology

There are many different engines and techniques we could have used to judge performance across the various GPUs. We decided to leverage the Hugging Face Text Generation Inference (TGI) engine as the primary way to serve Llama 3.1 70B. This was done for one primary reason. The benchmarking functionality with Hugging Face TGI is incredibly easy to use.

The benchmark tools provided with TGI allows us to look across batch sizes, prefill, and decode steps. It is a fantastic way to view Average, Min, and Max token per second as well as p50, p90, and p99 results. If you want to learn more about how to conduct benchmarks via TGI, reach out we would be happy to help.

Results

A30

Figure: Benchmark on 8xA30

RTX A6000

Figure: Benchmark on 4xA6000

L40

Figure: Benchmark on 4xL40

A100 SXM4

Figure: Benchmark on 2xA100

H100 PCIe

Figure: Benchmark on 2xH100

Conclusion

Hugging Face TGI provides a consistent mechanism to benchmark across multiple GPU types. Based on the performance of theses results we could also calculate the most cost effective GPU to run an inference endpoint for Llama 3.1. Understanding these nuances can help in making informed decisions when deploying Llama 3.1 70B, ensuring you get the best performance and value for your investment.

If you want to try and replicate these results check out our marketplace to rent a GPU Virtual Machine. Sign Up and within a few minutes you can have a working VM to test these results.