MPI error after loading TensorRT engines on Triton

Description

I am trying to start a Triton inference server using TensorRT as the backend on a ubuntu machine with 2 A10 gpus. I followed the instructions on GitHub to convert the Hugging Face checkpoints for the gemma-2-9b-it model and generate the engines. These seem to have been generated correctly because running a test on them returns a response. However, when I launch the Triton server and load the engines, I get the following error:


MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[macchina-triton:00141] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[macchina-triton:00141] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[macchina-triton:00141] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

Environment

TensorRT Version: 10.6.0
GPU Type: A10
Nvidia Driver Version: 535.216.03
CUDA Version: 12.6
CUDNN Version: 9.5.0
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if container which image + tag): tritonserver:24.11-trtllm-python-py3

Relevant Files

model: google/gemma-2-9b-it · Hugging Face
Link to the instructions followed: TensorRT-LLM/examples/gemma/README.md at main · NVIDIA/TensorRT-LLM · GitHub

Steps To Reproduce

Convert checkpoint:

  1. Clone the GitHub repo: https://github.com/NVIDIA/TensorRT-LLM.git and navigate to the examples/gemma directory.
  2. Install requirements.txt
  3. Run the following command:
python3 ./convert_checkpoint.py \
    --ckpt-type hf \
    --model-dir ${CKPT_PATH} \
    --dtype bfloat16 \
    --world-size 2 \
    --output-model-dir ${UNIFIED_CKPT_PATH} 

Obtain Engine:

The engine is generated with the following command:

trtllm-build --checkpoint_dir ${CKP_DIR} \ 
            --output_dir ${OUT_DIR} \ 
            --gemm_plugin auto

Setting for engine

In the config.json for the engines, set the following parameters:

 "opt_batch_size"=2,
 "max_batch_size"=1024,
 "gpus_per_node" = 2

Start Triton container

I cloned the following repository: GitHub - triton-inference-server/tensorrtllm_backend: The Triton TensorRT-LLM Backend to obtain the structure of the model repository. TRTLLM_DIR refers to the directory of the cloned folder.

docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
  -v ${TRTLLM_DIR}:/tensorrtllm_backend \
  -v ${ENGINE_DIR}:/engines \
  -v ${MODEL_DIR}:/models \
  nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3

Set model repo

Set up the model repository with the following commands:

mkdir /triton_model_repo
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /triton_model_repo/

Set the following environment variables:

ENGINE_DIR=/engines 
TOKENIZER_DIR=/models 
MODEL_FOLDER=/triton_model_repo 
TRITON_MAX_BATCH_SIZE=4 
INSTANCE_COUNT=1 
MAX_QUEUE_DELAY_MS=0 
MAX_QUEUE_SIZE=0 
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py 
DECOUPLED_MODE=false 

Execute the following Python commands to fill the template files:

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

Start inference server

Finally, start the inference server with the following command:

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=2 --model_repo=${MODEL_FOLDER}  

Log

root:/opt/tritonserver# I1218 16:10:56.313260 145 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7c179a000000' with size 268435456"
I1218 16:10:56.314134 146 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x77576e000000' with size 268435456"
I1218 16:10:56.328184 145 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1218 16:10:56.328198 145 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I1218 16:10:56.329718 146 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1218 16:10:56.329732 146 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
W1218 16:10:56.524914 145 server.cc:249] "failed to enable peer access for some device pairs"
I1218 16:10:56.526947 145 model_lifecycle.cc:473] "loading: postprocessing:1"
I1218 16:10:56.526990 145 model_lifecycle.cc:473] "loading: preprocessing:1"
I1218 16:10:56.527037 145 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I1218 16:10:56.527065 145 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
W1218 16:10:56.527874 146 server.cc:249] "failed to enable peer access for some device pairs"
I1218 16:10:56.529130 146 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I1218 16:10:56.611439 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1218 16:10:56.611445 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1218 16:10:56.656910 146 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1218 16:10:56.656938 146 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1218 16:10:56.656942 146 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1218 16:10:56.656945 146 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1218 16:10:56.660622 145 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1218 16:10:56.660649 145 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1218 16:10:56.660653 145 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1218 16:10:56.660658 145 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I1218 16:10:56.663443 146 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I1218 16:10:56.678529 145 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
I1218 16:10:56.678614 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 42
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 42
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1218 16:10:56.983431 145 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I1218 16:10:59.283743 145 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
I1218 16:10:59.293534 145 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 9710 MiB
[TensorRT-LLM][INFO] Loaded engine size: 9710 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[macchina-triton:00141] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[macchina-triton:00141] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[macchina-triton:00141] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The error “MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1” indicates issues with the Message Passing Interface (MPI), which are common in parallel computing environments. Here are some potential causes and solutions related to your situation:

Possible Causes:

  1. Incompatible MPI Library Versions: The MPI version may not match the TensorRT or Triton server requirements.
  2. GPU Driver Issues: There could be compatibility issues between the GPU drivers and the TensorRT version.
  3. Resource Allocation Problems: Insufficient resource allocation or conflicts with other processes on the machine may trigger this error.
  4. Network Communication Issues: Problems in communication between different ranks in the MPI communicator may cause the error.

Possible Solutions:

  1. Update MPI Library: Check and update the MPI library to ensure compatibility with Triton and TensorRT.
  2. Update GPU Drivers: Ensure that your GPU drivers are up to date and compatible with the TensorRT version you are using.
  3. Check Resource Allocation: Monitor resource usage on your machine to ensure sufficient GPU memory and CPU cores are allocated to the Triton server.
  4. Verify Network Settings: Ensure that network settings are properly configured to allow communication between different MPI ranks.

By investigating and addressing these possible causes, you can work towards resolving the “MPI_ABORT” error and successfully start your Triton inference server.

For additional help, i would recommend you reaching out to TRITON Forum.

Thanks