### System Info
Built tensorrtllm_backend from source using dockerfile/Docker…file.trt_llm_backend
tensorrt_llm 0.13.0.dev2024081300
tritonserver 2.48.0
triton image: 24.07
Cuda 12.5
### Who can help?
@Tracin @kaiyux @schetlur-nv
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks
- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)
### Reproduction
I've built a TRTLLM engine for meta llama 3 8B and I'm seeing the triton server get stuck while spawning if using tensor parallelism > 1.
Things work if I don't use tp while building the engine and spawning it.
Build the Engine:
```
python3 quantize.py --model_dir meta_llama_3_8B_instruct_fp16 \
--dtype float16 \
--qformat int4_awq \
--awq_block_size 128 \
--output_dir /tmp/trt_checkpoint \
--batch_size 8 \
--calib_size 32 \
--tp_size 2
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir /tmp/trt_checkpoint \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--kv_cache_type=paged \
--remove_input_padding enable \
--context_fmha enable \
--use_paged_context_fmha enable \
--max_seq_len 8000 \
--max_num_tokens 4096 \
--max_batch_size 8 \
--output_dir trt_model \
--log_level verbose \
--multiple_profiles enable \
--workers 2
```
Command used to launch the server:
```
python3 launch_triton_server.py --model_repo=triton_model_repo_copy
--world_size 2
--tensorrt_llm_model_name=meta_llama_3_8B_instruct_trt
--log
--log-file /tmp/logs.txt
--force
```
### Expected behavior
The server should spawn and start serving requests on localhost.
### actual behavior
I see the logs on the console:
```
I0819 17:56:19.460307 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0819 17:56:19.460335 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0819 17:56:19.752444 16867 model_lifecycle.cc:472] "loading: meta_llama_3_8B_instruct_trt:1"
I0819 17:56:19.918688 16867 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0819 17:56:19.918732 16867 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0819 17:56:19.918737 16867 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0819 17:56:19.918742 16867 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
I0819 17:56:19.933735 16867 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_8B_instruct_trt (version 1)"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2771 MiB
[TensorRT-LLM][INFO] Loaded engine size: 2771 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 125
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 125
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
```
In the /tmp/logs.txt file, I see the last output:
```
I0819 17:57:02.475316 16866 backend_model_instance.cc:783] "Starting backend thread for meta_llama_3_8B_instruct_trt_0_0 at nice 0 on device 0..."
I0819 17:57:02.475672 16866 backend_model.cc:675] "Created model instance named 'meta_llama_3_8B_instruct_trt_0_0' with device id '0'"
```
And nothing after this.
### additional notes
NA