Supercharging Llama 3.1 across NVIDIA Platforms

jwitsoe · July 23, 2024, 3:15pm

Originally published at: Supercharging Llama 3.1 across NVIDIA Platforms | NVIDIA Technical Blog

Meta’s Llama collection of large language models are the most popular foundation models in the open-source community today, supporting a variety of use cases. Millions of developers worldwide are building derivative models, and are integrating these into their applications. With Llama 3.1, Meta is launching a suite of large language models (LLMs) as well as…

dhruv13 · August 28, 2024, 11:12pm

I’ve tried running llama 3.1 with tensor parallelism, but it seems like the functionality is now broken on triton + tensorrtllm_backend?

github.com/triton-inference-server/tensorrtllm_backend

Unable to launch triton server with TP

opened 06:07PM - 19 Aug 24 UTC

dhruvmullick

bug

### System Info Built tensorrtllm_backend from source using dockerfile/Docker…file.trt_llm_backend tensorrt_llm 0.13.0.dev2024081300 tritonserver 2.48.0 triton image: 24.07 Cuda 12.5 ### Who can help? @Tracin @kaiyux @schetlur-nv ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction I've built a TRTLLM engine for meta llama 3 8B and I'm seeing the triton server get stuck while spawning if using tensor parallelism > 1. Things work if I don't use tp while building the engine and spawning it. Build the Engine: ``` python3 quantize.py --model_dir meta_llama_3_8B_instruct_fp16 \ --dtype float16 \ --qformat int4_awq \ --awq_block_size 128 \ --output_dir /tmp/trt_checkpoint \ --batch_size 8 \ --calib_size 32 \ --tp_size 2 CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir /tmp/trt_checkpoint \ --gemm_plugin float16 \ --gpt_attention_plugin float16 \ --kv_cache_type=paged \ --remove_input_padding enable \ --context_fmha enable \ --use_paged_context_fmha enable \ --max_seq_len 8000 \ --max_num_tokens 4096 \ --max_batch_size 8 \ --output_dir trt_model \ --log_level verbose \ --multiple_profiles enable \ --workers 2 ``` Command used to launch the server: ``` python3 launch_triton_server.py --model_repo=triton_model_repo_copy --world_size 2 --tensorrt_llm_model_name=meta_llama_3_8B_instruct_trt --log --log-file /tmp/logs.txt --force ``` ### Expected behavior The server should spawn and start serving requests on localhost. ### actual behavior I see the logs on the console: ``` I0819 17:56:19.460307 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864" I0819 17:56:19.460335 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864" I0819 17:56:19.752444 16867 model_lifecycle.cc:472] "loading: meta_llama_3_8B_instruct_trt:1" I0819 17:56:19.918688 16867 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm" I0819 17:56:19.918732 16867 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19" I0819 17:56:19.918737 16867 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19" I0819 17:56:19.918742 16867 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}" [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha). [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false I0819 17:56:19.933735 16867 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_8B_instruct_trt (version 1)" [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha). [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1 [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] Rank 1 is using GPU 1 [TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999 = maxSequenceLen - 1 since chunked context is enabled [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen. [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999 = maxSequenceLen - 1 since chunked context is enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen. [TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 2771 MiB [TensorRT-LLM][INFO] Loaded engine size: 2771 MiB [TensorRT-LLM][INFO] Detecting local TP group for rank 1 [TensorRT-LLM][INFO] Detecting local TP group for rank 0 [TensorRT-LLM][INFO] TP group is intra-node for rank 0 [TensorRT-LLM][INFO] TP group is intra-node for rank 1 [TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB) [TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles [TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB [TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true [TensorRT-LLM][INFO] Max KV cache pages per sequence: 125 [TensorRT-LLM][INFO] Max KV cache pages per sequence: 125 [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440). [TensorRT-LLM][INFO] Number of tokens per block: 64. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440). [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms) [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha). [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false [TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty [TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API. ``` In the /tmp/logs.txt file, I see the last output: ``` I0819 17:57:02.475316 16866 backend_model_instance.cc:783] "Starting backend thread for meta_llama_3_8B_instruct_trt_0_0 at nice 0 on device 0..." I0819 17:57:02.475672 16866 backend_model.cc:675] "Created model instance named 'meta_llama_3_8B_instruct_trt_0_0' with device id '0'" ``` And nothing after this. ### additional notes NA

anjshah · August 29, 2024, 12:37am

Have you tried following the suggestion in this thread? Disabling trt overlap and using same batch size between Triton and engine build?

dhruv13 · August 29, 2024, 4:41pm

I’ve tried same batch size, but there was no benefit. Updated thread Unable to launch triton server with TP · Issue #577 · triton-inference-server/tensorrtllm_backend · GitHub

trt overlap is already disabled too in the latest tensorrt_llm packages (I’m building from source to support llama3.1)

The step that worked for most people was to disable custom_all_reduce, but that’s not possible in the latest trtllm-build cli.

@anjshah, do you have recommendations? This seems like a problem common to quite a few people, judging by the Issues.

anjshah · August 29, 2024, 7:10pm

Hi @dhruv13 - let me try to repro the issue on my end and will get back!

dhruv13 · August 29, 2024, 10:50pm

Thank you @anjshah ! It probably doesn’t matter if you’re using the latest trtllm package, but if you’re building from source you could use the commit: GitHub - NVIDIA/TensorRT-LLM at 74b324f6673d1d8a836e05e506dea2234b22ccc8

anjshah · August 29, 2024, 11:09pm

Hi @dhruv13 - Can you try with v0.12 as shared in this thread? TensorRT-LLM v0.12.0 was just released today and introduced many build command changes. Updated steps are here. Please keep us posted with log file details if you are still encountering issues

dhruv13 · August 30, 2024, 1:04am

Sure. I was on a 0.13 dev build, but I’ll try the official 0.12 release now and get back here.
Thanks!

dhruv13 · August 30, 2024, 4:21pm

I tried the latest triton server image, but to no avail. I’ve pasted the logs in Github (same link as earlier)
Thanks!

anjshah · August 30, 2024, 5:26pm

which triton server image did you use?

dhruv13 · August 30, 2024, 5:46pm

It’s nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
Thanks!

anjshah · August 30, 2024, 9:37pm

Can you use the officially released v0.12.0 of TensorRT-LLM, not the dev version? And can you try without enabling --reduce_fusion? It’s disabled by default, so please try as per the steps here

dhruv13 · September 3, 2024, 3:03pm

Hi @anjshah, can you recommend which official docker image to use for triton server with trtllm backend?

Note that as per this comment from Kris Hung (Nvidia), the versions I’m using are the officially supported versions in the latest triton server with TRTLLM backend.

anjshah · September 17, 2024, 6:54pm

Hi @dhruv13 - have you tried using nvcr.io/nvidia/tritonserver:24.07-py3-min?

Topic		Replies	Views
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	4064	August 28, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	320	May 3, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1878	January 25, 2024
NVIDIA 플랫폼 전반에서 Llama 3.1 강화하기 Technical Blog - South Korea llama	1	51	August 2, 2024
Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM Technical Blog	3	573	April 18, 2024
Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron Technical Blog	3	1025	March 31, 2023
Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs Technical Blog llama	2	94	September 17, 2024
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	231	February 3, 2025
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B Technical Blog llama	3	109	October 24, 2024
Recommend Compute for running a TensorRT-LLM using LLama2 13B & 70B model TensorRT	2	1096	November 15, 2023

Supercharging Llama 3.1 across NVIDIA Platforms

Related topics