Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

anjshah · May 29, 2024, 2:01am

jwitsoe:

#Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Hi @richard.forman - The config files to modify are in the ‘all_models/inflight_batcher_llm’ directory in subfolders (preprocessing, postprocessing, tensor_llm_bls, ensemble, tensor_llm). The fill_template.py script helps modify these config files for you. It’s these following lines of code as described in the blogpost:

#Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

richard.forman · May 29, 2024, 6:14pm

Hi @anjshah,

Thanks for clearing that up for me. I didn’t realise that these lines:

were meant to set environment variables. I thought they were variable definitions that needed to be added to some config file.

matatov.n · May 30, 2024, 12:26am

Hi,

Do you have some results to understand if reported Llama 3 performance on evaluation benchmarks are reproduced to some degree with Llama 3 that is deployed with TensortLLM ?

richard.forman · May 30, 2024, 12:32am

Hi @anjshah,
Thank you for all of the help. I was able to successfully send the api call and got the proper response.

I do have some performance related questions though. I have this currently deployed on an AWS EC2 g5.8xlarge instance. What kind of performance can I expect? How many clients can be sending requests at a time? For heavy traffic will we need to set this up as an autoscaling group behind a load balancer?

anjshah · May 30, 2024, 6:08am

Hi @richard.forman - Triton inference server backend is already setup to use TensorRT-LLM’s batch manager to handle multiple concurrent requests without affecting latency. See an example to send multiple simultaneously requests to deployed model here.

anjshah · May 30, 2024, 6:15am

Hi @matatov.n - we’ve the published some perf numbers here and we’ll soon be adding Llama3 and other newly added models’ perf numbers to this overview page. Meanwhile if you would like to take a stab at it, please follow these instructions to reproduce the benchmark results by using the TRT engine you built for Llama 3.

matatov.n · May 30, 2024, 2:52pm

Thanks you for your answer . I am also interested in benchmarking accuracy of compiled with TensorRT-LLM models on MMLU task , for example . Or other popular evaluation tasks.

matatov.n · May 30, 2024, 7:47pm

Hi , everyone

I have a question : are token streaming, paged attention, and KV cache are available just after compilation with TensorRT-LLM and or does it need using Triton Inference Server ? What if I am just deploy the compiled model to GKE (after dockerization)
Thanks

anjshah · May 31, 2024, 5:03am

Hi @matatov.n - yes, we have a mmlu script that you can use for this purpose

anjshah · May 31, 2024, 5:09am

No, you don’t need Triton inference server to use these TensorRT-LLM features. You can follow this example with run script

matatov.n · May 31, 2024, 11:15am

Hi @anjshah ,

Do you have some ready results to show that using TensortRT-LLM maintains original (i.e. Llama 3) results ?
I am afraid , running the evaluations by ourselves can be very expensive for us.

anjshah · May 31, 2024, 4:09pm

Yes we do and plan to post them early next week

matatov.n · June 1, 2024, 9:41pm

Hi ,

I am searching for Dockerfile that given rank0.engine and config.json files would expose inference server . Any link for reference ?

Thanks

matatov.n · June 10, 2024, 10:15pm

Hi , @anjshah

Do you have a post with evaluations ?

anjshah · June 10, 2024, 11:10pm

Hi @matatov.n - if you mean the latest, Llama3 perf evaluations, yes we’ve updated those on the perf-overview page of the Github docs. If you meant the MMLU evaluations, we have provided the mmlu.py script and you can use as shown here

prots.igor · July 3, 2024, 12:51pm

@anjshah I have another question about gRPC API. I used your example and config for building a docker image with triton and llama3 and /generate endpoint works well. But now I need switch to gRPC with steaming support. So I created a go gRPC client and inference request:

	request := proto.ModelInferRequest{
		Id:           uuid.New().String(),
		ModelName:    modelName,
		ModelVersion: "1",
		Inputs: []*proto.ModelInferRequest_InferInputTensor{
			{
				Name:     "text_input",
				Shape:    []int64{-1, -1},
				Datatype: "BYTES",
				Contents: &proto.InferTensorContents{
					BytesContents: [][]byte{[]byte(text)},
				},
			},
			{
				Name:     "max_tokens",
				Shape:    []int64{-1, -1},
				Datatype: "INT32",
				Contents: &proto.InferTensorContents{
					IntContents: []int32{maxTokens},
				},
			},
			{
				Name:     "stop_words",
				Shape:    []int64{-1, -1},
				Datatype: "BYTES",
				Contents: &proto.InferTensorContents{
					BytesContents: [][]byte{[]byte(stopWord)},
				},
			},
			{
				Name:     "bad_words",
				Shape:    []int64{-1, -1},
				Datatype: "BYTES",
				Contents: &proto.InferTensorContents{
					BytesContents: [][]byte{[]byte("")},
				},
			},
		},

		Outputs: []*proto.ModelInferRequest_InferRequestedOutputTensor{
			{
				Name: "text_output",
			},
		},
	}

In response from triton I received error calling sync inference endpoint rpc error: code = InvalidArgument desc = [request id: 98bb094c-7895-4448-926c-4c31d53b9be4] input 'max_tokens' batch size does not match other inputs for 'ensemble'
So what was wrong? I checked that each model has max_batch_size=64 and have no idea what to change in the config or in the request…

anjshah · July 5, 2024, 1:32am

Hi @prots.igor - For your “max_tokens” input there seems to be a typo. Can you check after replacing “INT32” with “BYTES” as the other inputs?

shacotustra · July 5, 2024, 6:29pm

I used checkpoints from this repo TheFloat16/Llama3-70b-Instruct-TRTLLM at main
I have three Tesla V100-PCIE-32GB. I built the engines from those checkpoints on TRT-LLM 0.8.0 in the docker container just like it was wrote in the tutorial.
Building engines went fine, but launching the server ends up with an error.

[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherw
ise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherw
ise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
I0705 18:26:12.335485 693 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘max_draft_len’ not found
[TensorRT-LLM][WARNING] Parameter paged_state cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘paged_state’ not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘max_draft_len’ not found
[TensorRT-LLM][WARNING] Parameter paged_state cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘paged_state’ not found
E0705 18:26:12.335788 694 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key ‘lora_config’ not found
E0705 18:26:12.335837 694 model_lifecycle.cc:638] failed to load ‘tensorrt_llm’ version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key ‘lora_config’ no
t found
I0705 18:26:12.335849 694 model_lifecycle.cc:773] failed to load ‘tensorrt_llm’
I0705 18:26:12.335951 694 server.cc:607]

anjshah · July 5, 2024, 6:40pm

Hi @shacotustra - can you please uninstall v0.8.0 and follow the same steps, but replace 0.8.0 with 0.9.0? And use the triton inference server v24.4.

shacotustra · July 5, 2024, 6:52pm

When I try to build, an error occures

[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/77/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/77/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/77/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/77/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/77/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/78/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/78/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[TensorRT-LLM][WARNING] Note that alibi or sliding window attention are not supported for FMHA on Volta
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/78/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/78/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/79/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/79/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[TensorRT-LLM][WARNING] Note that alibi or sliding window attention are not supported for FMHA on Volta
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/79/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/79/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[07/05/2024-18:48:04] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[07/05/2024-18:48:04] [TRT] [W] Unused Input: position_ids
[07/05/2024-18:48:04] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[07/05/2024-18:48:04] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1837] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
[07/05/2024-18:48:04] [TRT-LLM] [E] Engine building failed, please check the error log.
[07/05/2024-18:48:04] [TRT] [I] Serialized 59 bytes of code generator cache.
[07/05/2024-18:48:04] [TRT] [I] Serialized 0 timing cache entries
[07/05/2024-18:48:04] [TRT-LLM] [I] Timing cache serialized to model.cache
[07/05/2024-18:48:04] [TRT-LLM] [I] Serializing engine to /TensorRT-LLM/TensorRT-LLM/Llama3-70B-engines/rank0.engine…
Traceback (most recent call last):
File “/usr/local/bin/trtllm-build”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 440, in main
parallel_build(source, build_config, args.output_dir, workers,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 332, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 298, in build_and_save
engine.save(output_dir)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py”, line 566, in save
serialize_engine(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py”, line 105, in serialize_engine
f.write(engine)
TypeError: a bytes-like object is required, not ‘NoneType’
root@207d17ed1e43:/TensorRT-LLM/TensorRT-LLM#

Topic		Replies	Views
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	217	September 17, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1769	January 25, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	295	May 3, 2024
Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron Technical Blog	3	987	March 31, 2023
Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM Technical Blog	3	532	April 18, 2024
Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server Technical Blog	7	1028	April 19, 2023
Recommend Compute for running a TensorRT-LLM using LLama2 13B & 70B model TensorRT	2	1065	November 15, 2023
Triton server for squad model on P100 with TensorRT 6.0 Triton Inference Server (archived)	0	905	June 23, 2020
No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models TensorRT	3	1575	February 21, 2019
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	202	February 3, 2025

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Related topics