High-throughput serving Llama-3.1 on A100 w/ VLLM or Llama.cpp

Hi, all:
We’re excited to be part of the Inception program. We’ve been trying to get speed from Llama 3.1 on an A100 and would love some advice on common parameters, quants, etc. to get the best response times. Can anyone advise?

thanks!

Hi @mark368 - welcome to the Inception program. And thanks for your patience on a reply here. I’ve reached out to the right team to get an answer and we will get back to you asap!

Hi @mark368 take a look at Model Profiles — NVIDIA NIM for Large Language Models (LLMs) - you can select a NIM profile for your model that targets throughput optimisation.

Run docker run --rm --runtime=nvidia --gpus=all $IMG_NAME list-model-profiles \ -e NGC_API_KEY=$NGC_API_KEY to see the profiles for your image then select one appropriately!