Description
I am trying to start a Triton inference server using TensorRT as the backend on a ubuntu machine with 2 A10 gpus. I followed the instructions on GitHub to convert the Hugging Face checkpoints for the gemma-2-9b-it model and generate the engines. These seem to have been generated correctly because running a test on them returns a response. However, when I launch the Triton server and load the engines, I get the following error:
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
[macchina-triton:00141] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[macchina-triton:00141] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[macchina-triton:00141] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
Environment
TensorRT Version: 10.6.0
GPU Type: A10
Nvidia Driver Version: 535.216.03
CUDA Version: 12.6
CUDNN Version: 9.5.0
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if container which image + tag): tritonserver:24.11-trtllm-python-py3
Relevant Files
model: google/gemma-2-9b-it · Hugging Face
Link to the instructions followed: TensorRT-LLM/examples/gemma/README.md at main · NVIDIA/TensorRT-LLM · GitHub
Steps To Reproduce
Convert checkpoint:
- Clone the GitHub repo: https://github.com/NVIDIA/TensorRT-LLM.git and navigate to the
examples/gemma
directory. - Install requirements.txt
- Run the following command:
python3 ./convert_checkpoint.py \
--ckpt-type hf \
--model-dir ${CKPT_PATH} \
--dtype bfloat16 \
--world-size 2 \
--output-model-dir ${UNIFIED_CKPT_PATH}
Obtain Engine:
The engine is generated with the following command:
trtllm-build --checkpoint_dir ${CKP_DIR} \
--output_dir ${OUT_DIR} \
--gemm_plugin auto
Setting for engine
In the config.json for the engines, set the following parameters:
"opt_batch_size"=2,
"max_batch_size"=1024,
"gpus_per_node" = 2
Start Triton container
I cloned the following repository: GitHub - triton-inference-server/tensorrtllm_backend: The Triton TensorRT-LLM Backend to obtain the structure of the model repository. TRTLLM_DIR refers to the directory of the cloned folder.
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
-v ${TRTLLM_DIR}:/tensorrtllm_backend \
-v ${ENGINE_DIR}:/engines \
-v ${MODEL_DIR}:/models \
nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3
Set model repo
Set up the model repository with the following commands:
mkdir /triton_model_repo
cp -r /tensorrtllm_backend/all_models/inflight_batcher_llm/* /triton_model_repo/
Set the following environment variables:
ENGINE_DIR=/engines
TOKENIZER_DIR=/models
MODEL_FOLDER=/triton_model_repo
TRITON_MAX_BATCH_SIZE=4
INSTANCE_COUNT=1
MAX_QUEUE_DELAY_MS=0
MAX_QUEUE_SIZE=0
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
DECOUPLED_MODE=false
Execute the following Python commands to fill the template files:
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT},max_queue_size:${MAX_QUEUE_SIZE}
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
Start inference server
Finally, start the inference server with the following command:
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=2 --model_repo=${MODEL_FOLDER}
Log
root:/opt/tritonserver# I1218 16:10:56.313260 145 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7c179a000000' with size 268435456"
I1218 16:10:56.314134 146 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x77576e000000' with size 268435456"
I1218 16:10:56.328184 145 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1218 16:10:56.328198 145 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I1218 16:10:56.329718 146 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I1218 16:10:56.329732 146 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
W1218 16:10:56.524914 145 server.cc:249] "failed to enable peer access for some device pairs"
I1218 16:10:56.526947 145 model_lifecycle.cc:473] "loading: postprocessing:1"
I1218 16:10:56.526990 145 model_lifecycle.cc:473] "loading: preprocessing:1"
I1218 16:10:56.527037 145 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I1218 16:10:56.527065 145 model_lifecycle.cc:473] "loading: tensorrt_llm_bls:1"
W1218 16:10:56.527874 146 server.cc:249] "failed to enable peer access for some device pairs"
I1218 16:10:56.529130 146 model_lifecycle.cc:473] "loading: tensorrt_llm:1"
I1218 16:10:56.611439 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
I1218 16:10:56.611445 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I1218 16:10:56.656910 146 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1218 16:10:56.656938 146 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1218 16:10:56.656942 146 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1218 16:10:56.656945 146 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
I1218 16:10:56.660622 145 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I1218 16:10:56.660649 145 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I1218 16:10:56.660653 145 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I1218 16:10:56.660658 145 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I1218 16:10:56.663443 146 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] participant_ids is not specified, will be automatically set
I1218 16:10:56.678529 145 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: tensorrt_llm (version 1)"
I1218 16:10:56.678614 145 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] cross_kv_cache_fraction is not specified, error if it's encoder-decoder model, otherwise ok
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multi_block_mode is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_context_fmha_fp32_acc is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_mode is not specified, will be set to false
[TensorRT-LLM][WARNING] cuda_graph_cache_size is not specified, will be set to 0
[TensorRT-LLM][INFO] speculative_decoding_fast_logits is not specified, will be set to false
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] recv_poll_period_ms is not set, will use busy loop
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.16.0.dev2024120300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 42
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8192
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8192) * 42
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8191 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I1218 16:10:56.983431 145 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
I1218 16:10:59.283743 145 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
I1218 16:10:59.293534 145 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
[TensorRT-LLM][INFO] Loaded engine size: 9710 MiB
[TensorRT-LLM][INFO] Loaded engine size: 9710 MiB
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[macchina-triton:00141] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[macchina-triton:00141] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[macchina-triton:00141] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages