High-throughput serving Llama-3.1 on A100 w/ VLLM or Llama.cpp

mark368 · January 23, 2025, 12:02pm

Hi, all:
We’re excited to be part of the Inception program. We’ve been trying to get speed from Llama 3.1 on an A100 and would love some advice on common parameters, quants, etc. to get the best response times. Can anyone advise?

thanks!

sophwats · January 27, 2025, 11:43am

Hi @mark368 - welcome to the Inception program. And thanks for your patience on a reply here. I’ve reached out to the right team to get an answer and we will get back to you asap!

sophwats · January 27, 2025, 1:07pm

Hi @mark368 take a look at Model Profiles — NVIDIA NIM for Large Language Models (LLMs) - you can select a NIM profile for your model that targets throughput optimisation.

Run docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles \ -e NGC_API_KEY=$NGC_API_KEY to see the profiles for your image then select one appropriately!

Topic		Replies	Views
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B AI Foundation Models and Endpoints nim , llm , llama	0	73	September 23, 2024
Profiles doesnt match machine even though specs are correct Models nim , llama-31-405b-instruct , llama	0	33	November 29, 2024
LLM Performance Benchmarking: Measuring NVIDIA NIM Performance with GenAI-Perf Technical Blog nim , llama	1	17	May 6, 2025
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B Technical Blog llama	3	70	October 24, 2024
NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick Technical Blog nim , llama	2	148	April 12, 2025
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	205	November 12, 2024
How to fix 0 compatible profiles? Where to get compatible profiles? Models nim , llama-31-8b-instruct , llama	4	488	November 26, 2024
Build intelligent chatbots, enhance search engines, and develop educational tools with Llama 3-ChatQA Technical Blog	1	72	June 26, 2024
NIM Llama 3.3 70B requirements Models hw , nim , llama	2	359	March 21, 2025
NIM TensorRT-LLM on H100 NVL Models nim , llama-31-8b-instruct , llama	2	146	November 22, 2024

High-throughput serving Llama-3.1 on A100 w/ VLLM or Llama.cpp

Related topics