MPI error after loading TensorRT engines on Triton

andreoli.f98 · December 19, 2024, 8:55am

Description

I am trying to start a Triton inference server using TensorRT as the backend on a ubuntu machine with 2 A10 gpus. I followed the instructions on GitHub to convert the Hugging Face checkpoints for the gemma-2-9b-it model and generate the engines. These seem to have been generated correctly because running a test on them returns a response. However, when I launch the Triton server and load the engines, I get the following error:

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[macchina-triton:00141] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[macchina-triton:00141] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[macchina-triton:00141] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

Environment

TensorRT Version: 10.6.0
GPU Type: A10
Nvidia Driver Version: 535.216.03
CUDA Version: 12.6
CUDNN Version: 9.5.0
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if container which image + tag): tritonserver:24.11-trtllm-python-py3

Relevant Files

model: google/gemma-2-9b-it · Hugging Face
Link to the instructions followed: TensorRT-LLM/examples/gemma/README.md at main · NVIDIA/TensorRT-LLM · GitHub

Steps To Reproduce

Convert checkpoint:

Clone the GitHub repo: https://github.com/NVIDIA/TensorRT-LLM.git and navigate to the examples/gemma directory.
Install requirements.txt
Run the following command:

python3 ./convert_checkpoint.py \
    --ckpt-type hf \
    --model-dir ${CKPT_PATH} \
    --dtype bfloat16 \
    --world-size 2 \
    --output-model-dir ${UNIFIED_CKPT_PATH}

Obtain Engine:

The engine is generated with the following command:

trtllm-build --checkpoint_dir ${CKP_DIR} \ 
            --output_dir ${OUT_DIR} \ 
            --gemm_plugin auto

Setting for engine

In the config.json for the engines, set the following parameters:

 "opt_batch_size"=2,
 "max_batch_size"=1024,
 "gpus_per_node" = 2

Start Triton container

I cloned the following repository: GitHub - triton-inference-server/tensorrtllm_backend: The Triton TensorRT-LLM Backend to obtain the structure of the model repository. TRTLLM_DIR refers to the directory of the cloned folder.

docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
  -v ${TRTLLM_DIR}:/tensorrtllm_backend \
  -v ${ENGINE_DIR}:/engines \
  -v ${MODEL_DIR}:/models \
  nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3

Set model repo

Set up the model repository with the following commands:

mkdir /triton_model_repo
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /triton_model_repo/

Set the following environment variables:

ENGINE_DIR=/engines 
TOKENIZER_DIR=/models 
MODEL_FOLDER=/triton_model_repo 
TRITON_MAX_BATCH_SIZE=4 
INSTANCE_COUNT=1 
MAX_QUEUE_DELAY_MS=0 
MAX_QUEUE_SIZE=0 
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py 
DECOUPLED_MODE=false

Execute the following Python commands to fill the template files:

python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}

Start inference server

Finally, start the inference server with the following command:

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=2 --model_repo=${MODEL_FOLDER}

Log

root:/opt/tritonserver# I1218 16:10:56.313260 145 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7c179a000000' with size 268435456"
I1218 16:10:56.314134 146 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x77576e000000' with size 268435456"
I1218 16:10:56.328184 145 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1218 16:10:56.328198 145 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I1218 16:10:56.329718 146 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1218 16:10:56.329732 146 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
W1218 16:10:56.524914 145 server.cc:249] "failed to enable peer access for some device pairs"
I1218 16:10:56.526947 145 model_lifecycle.cc:473] "loading: postprocessing:1"
I1218 16:10:56.526990 145 model_lifecycle.cc:473] "loading: preprocessing:1"
I1218 16:10:56.527037 145 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I1218 16:10:56.527065 145 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
W1218 16:10:56.527874 146 server.cc:249] "failed to enable peer access for some device pairs"
I1218 16:10:56.529130 146 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I1218 16:10:56.611439 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1218 16:10:56.611445 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1218 16:10:56.656910 146 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1218 16:10:56.656938 146 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1218 16:10:56.656942 146 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1218 16:10:56.656945 146 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1218 16:10:56.660622 145 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1218 16:10:56.660649 145 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1218 16:10:56.660653 145 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1218 16:10:56.660658 145 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I1218 16:10:56.663443 146 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I1218 16:10:56.678529 145 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
I1218 16:10:56.678614 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 42
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 42
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1218 16:10:56.983431 145 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I1218 16:10:59.283743 145 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
I1218 16:10:59.293534 145 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 9710 MiB
[TensorRT-LLM][INFO] Loaded engine size: 9710 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[macchina-triton:00141] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[macchina-triton:00141] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[macchina-triton:00141] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

AakankshaS · December 31, 2024, 9:48am

The error “MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1” indicates issues with the Message Passing Interface (MPI), which are common in parallel computing environments. Here are some potential causes and solutions related to your situation:

Possible Causes:

Incompatible MPI Library Versions: The MPI version may not match the TensorRT or Triton server requirements.
GPU Driver Issues: There could be compatibility issues between the GPU drivers and the TensorRT version.
Resource Allocation Problems: Insufficient resource allocation or conflicts with other processes on the machine may trigger this error.
Network Communication Issues: Problems in communication between different ranks in the MPI communicator may cause the error.

Possible Solutions:

Update MPI Library: Check and update the MPI library to ensure compatibility with Triton and TensorRT.
Update GPU Drivers: Ensure that your GPU drivers are up to date and compatible with the TensorRT version you are using.
Check Resource Allocation: Monitor resource usage on your machine to ensure sufficient GPU memory and CPU cores are allocated to the Triton server.
Verify Network Settings: Ensure that network settings are properly configured to allow communication between different MPI ranks.

By investigating and addressing these possible causes, you can work towards resolving the “MPI_ABORT” error and successfully start your Triton inference server.

For additional help, i would recommend you reaching out to TRITON Forum.

Thanks

Topic		Replies	Views
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	174	September 17, 2024
Regarding when we execute triton server on jetson orin getting an error unable to load model DeepStream SDK cuda	19	800	July 30, 2024
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	3597	August 28, 2024
`Error No Op registered for NMSDynamic_TRT...` when trying to run Trition inference server with a SSD model TAO Toolkit jetson	12	1248	October 12, 2023
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT	6	991	July 15, 2021
Tao-converted .plan model running in triton-server turned to bad accurate TAO Toolkit	46	3554	April 1, 2022
Error Code 1: Serialization (Serialization assertion safeVersionRead== kSAFE_SERIALIZATION_VERSION failed.Version tag does not match. Note: Current Ve TensorRT llama	1	52	February 28, 2025
Convert tensorrt engine from version 7 to 8 TAO Toolkit tensorrt	67	4370	October 12, 2021
Triton inference server is sending back "HTTP/1.1 400 Bad Request" TAO Toolkit	6	3430	October 12, 2021
Triton Server Error with TAO FasterRCNN model: Validation failed: libNamespace == nullptr TAO Toolkit	12	59	April 1, 2025