Supercharging Llama 3.1 across NVIDIA Platforms

Originally published at: Supercharging Llama 3.1 across NVIDIA Platforms | NVIDIA Technical Blog

Meta’s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are building derivative models, and are integrating these into their applications.  With Llama 3.1, Meta is launching a suite of large language models (LLMs) as well as…

I’ve tried running llama 3.1 with tensor parallelism, but it seems like the functionality is now broken on triton + tensorrtllm_backend?

Have you tried following the suggestion in this thread? Disabling trt overlap and using same batch size between Triton and engine build?

I’ve tried same batch size, but there was no benefit. Updated thread Unable to launch triton server with TP · Issue #577 · triton-inference-server/tensorrtllm_backend · GitHub

trt overlap is already disabled too in the latest tensorrt_llm packages (I’m building from source to support llama3.1)

The step that worked for most people was to disable custom_all_reduce, but that’s not possible in the latest trtllm-build cli.

@anjshah, do you have recommendations? This seems like a problem common to quite a few people, judging by the Issues.

Hi @dhruv13 - let me try to repro the issue on my end and will get back!

1 Like

Thank you @anjshah ! It probably doesn’t matter if you’re using the latest trtllm package, but if you’re building from source you could use the commit: GitHub - NVIDIA/TensorRT-LLM at 74b324f6673d1d8a836e05e506dea2234b22ccc8

Hi @dhruv13 - Can you try with v0.12 as shared in this thread? TensorRT-LLM v0.12.0 was just released today and introduced many build command changes. Updated steps are here. Please keep us posted with log file details if you are still encountering issues

Sure. I was on a 0.13 dev build, but I’ll try the official 0.12 release now and get back here.
Thanks!

I tried the latest triton server image, but to no avail. I’ve pasted the logs in Github (same link as earlier)
Thanks!

which triton server image did you use?

It’s nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
Thanks!

Can you use the officially released v0.12.0 of TensorRT-LLM, not the dev version? And can you try without enabling --reduce_fusion? It’s disabled by default, so please try as per the steps here

Hi @anjshah, can you recommend which official docker image to use for triton server with trtllm backend?

Note that as per this comment from Kris Hung (Nvidia), the versions I’m using are the officially supported versions in the latest triton server with TRTLLM backend.