With the exciting launch of Meta’s Llama 3 LLM, we were curious about which application would be the best to serve Llama 3 as an inference endpoint. The best, in this case, would be determined by the application with the highest tokens/sec rate. There are a whole range of tools and solutions to create inference endpoints we wanted to focus our test on a few that our users use the most.
- Ollama
- Text Generation WebUI
- LM Studio
Considerations
One important factor when conducting this research was to ensure we used the same model across the tools or get as close as possible. Two tools on this list impacted this consideration, Ollama & LM Studio.
Out of the box, Ollama uses a 4-bit quantized version of Llama 3 70B. Quantizing a model is a technique that involves converting the precision of the numbers used in the model from a higher precision (like 32-bit floating point) to a lower precision (like 4-bit integers). Quantization is a balance between efficiency and accuracy.
For LM Studio, we could only find GGUF quantized versions available. We decided to use the LM Studio Community version when testing on LM Studio and Text Generation WebUI. For Ollama, we just used the 70B version they provided out of the box.
Testing Scenario
For this test, we leveraged a single A6000 from our Virtual Machine marketplace. We leveraged an A6000 because it has 48GB of vRAM and the 4-bit quantized models used were about 40-42GB that will be loaded onto a GPU. You could use an L40, L40S, A6000 ADA, or even A100 or H100 cards. We chose to keep it simple with the standard RTX A6000.
For each tool, we started with a fresh Virtual Machine using the base Ubuntu OS and installed the tool we were testing. Once the tool was installed, we would load the appropriate model and interface via API to try and make things similar across the various options.
For each tool, we made five requests with the same prompt: Why is the sky blue? We didn’t care about the accuracy of the generation, we were simply looking at the generated tokens/sec each one generated. The token/sec results were then averaged giving us a final result for each solution.
Results
Ollama
Totals (token/sec)
- 14.58 tokens/s
- 14.53 tokens/s
- 14.51 tokens/s
- 14.46 tokens/s
- 14.50 tokens/s
Average token/sec – 14.516 token/sec
Text Generation WebUI
Totals (token/sec)
- 9.45 tokens/s
- 11.22 tokens/s
- 7.49 tokens/s
- 9.45 tokens/s
- 7.84 tokens/s
Average token/sec – 9.09 token/sec
LM Studio
Totals (token/sec)
- 13.43 tokens/s
- 13.49 tokens/s
- 13.43 tokens/s
- 13.62 tokens/s
- 13.56 tokens/s
Average token/sec – 13.506 token/sec
Based on the results Ollama is ahead slightly over LM Studio. We should mention that Ollama was by far the easiest to set up and start working with right away. LM Studio wasn’t that complicated, we just had to remember to make sure we leveraged the various settings to offload the compute to the GPU. Text Generation Web UI was the most complex. On our first attempt, we only received an average of 5.05 tokens/sec.
Both LM Studio and Text Generation Web UI provide more settings to adjust the performance of the model, for this test we made all settings in LM Studio and Text Generation Web UI the same. The settings are all max values in LM Studio. Text Generation Web UI could be enhanced more on these settings but for the sake of the test, we kept them consistent.
Settings:
n_gpu_layers: 80
n_batch: 512
n_ctx: 2048
Conclusion
With a few short hours, it was great to test various products on the market to see which is best to create a quick inference endpoint. In the future, we are looking at testing other inference engines like vLLM and Hugging Face’s Text Generation Inference. These two engines will allow us to leverage the full unquantized version of the model but will require more GPU power.
If you want to try and replicate these results check out our marketplace to rent a GPU Virtual Machine. Use the coupon code MassedComputeResearch for 15% off any A6000 or A5000 rental.