Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Hi @richard.forman - The config files to modify are in the ‘all_models/inflight_batcher_llm’ directory in subfolders (preprocessing, postprocessing, tensor_llm_bls, ensemble, tensor_llm). The fill_template.py script helps modify these config files for you. It’s these following lines of code as described in the blogpost:

#Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Hi @anjshah,

Thanks for clearing that up for me. I didn’t realise that these lines:

were meant to set environment variables. I thought they were variable definitions that needed to be added to some config file.

1 Like

Hi,

Do you have some results to understand if reported Llama 3 performance on evaluation benchmarks are reproduced to some degree with Llama 3 that is deployed with TensortLLM ?

Hi @anjshah,
Thank you for all of the help. I was able to successfully send the api call and got the proper response.

I do have some performance related questions though. I have this currently deployed on an AWS EC2 g5.8xlarge instance. What kind of performance can I expect? How many clients can be sending requests at a time? For heavy traffic will we need to set this up as an autoscaling group behind a load balancer?

1 Like

Hi @richard.forman - Triton inference server backend is already setup to use TensorRT-LLM’s batch manager to handle multiple concurrent requests without affecting latency. See an example to send multiple simultaneously requests to deployed model here.

Hi @matatov.n - we’ve the published some perf numbers here and we’ll soon be adding Llama3 and other newly added models’ perf numbers to this overview page. Meanwhile if you would like to take a stab at it, please follow these instructions to reproduce the benchmark results by using the TRT engine you built for Llama 3.

1 Like

Thanks you for your answer . I am also interested in benchmarking accuracy of compiled with TensorRT-LLM models on MMLU task , for example . Or other popular evaluation tasks.

Hi , everyone

I have a question : are token streaming, paged attention, and KV cache are available just after compilation with TensorRT-LLM and or does it need using Triton Inference Server ? What if I am just deploy the compiled model to GKE (after dockerization)
Thanks

Hi @matatov.n - yes, we have a mmlu script that you can use for this purpose

No, you don’t need Triton inference server to use these TensorRT-LLM features. You can follow this example with run script

1 Like

Hi @anjshah ,

Do you have some ready results to show that using TensortRT-LLM maintains original (i.e. Llama 3) results ?
I am afraid , running the evaluations by ourselves can be very expensive for us.

Yes we do and plan to post them early next week

1 Like

Hi ,

I am searching for Dockerfile that given rank0.engine and config.json files would expose inference server . Any link for reference ?

Thanks

Hi , @anjshah

Do you have a post with evaluations ?

Hi @matatov.n - if you mean the latest, Llama3 perf evaluations, yes we’ve updated those on the perf-overview page of the Github docs. If you meant the MMLU evaluations, we have provided the mmlu.py script and you can use as shown here