Error Code 1: Serialization (Serialization assertion safeVersionRead== kSAFE_SERIALIZATION_VERSION failed.Version tag does not match. Note: Current Ve

Description

I’m using Amazon EC2 g5.8xlarge instance type & following below document and trying to deploy Meta-Llama-3-8B-Instruct model & triton inference server.

I also referred below compatibility matrix
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/introduction/compatibility.html

I’m using
Triton release version: 24.06
Python version: 3.10.12
TensorRT-LLM version: 0.10.0
CUDA version: 12.4.0.041
CUDA Driver version: 550.54.14

I used 0.10.0 branch for GitHub - triton-inference-server/tensorrtllm_backend: The Triton TensorRT-LLM Backend.
Used 0.10.0 version for tensorrt_llm pip package.

Upon executing last command of the document

python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1

I’m getting below error
I0206 15:23:04.598706 721 pinned_memory_manager.cc:277] “Pinned memory pool is created at ‘0x7fc780000000’ with size 268435456”
I0206 15:23:04.599117 721 cuda_memory_manager.cc:107] “CUDA memory pool is created on device 0 with size 67108864”
I0206 15:23:04.602520 721 model_lifecycle.cc:472] “loading: postprocessing:1”
I0206 15:23:04.602579 721 model_lifecycle.cc:472] “loading: preprocessing:1”
I0206 15:23:04.602641 721 model_lifecycle.cc:472] “loading: tensorrt_llm:1”
I0206 15:23:04.602685 721 model_lifecycle.cc:472] “loading: tensorrt_llm_bls:1”
I0206 15:23:04.679505 721 python_be.cc:1912] “TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)”
I0206 15:23:04.679543 721 python_be.cc:1912] “TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)”
I0206 15:23:04.728032 721 python_be.cc:1912] “TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)”
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][INFO] Engine version 0.10.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘cross_attention’ not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘layer_types’ not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘num_medusa_heads’ not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘max_draft_len’ not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I0206 15:23:04.971970 721 model_lifecycle.cc:838] “successfully loaded ‘tensorrt_llm_bls’”
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I0206 15:23:06.683378 721 model_lifecycle.cc:838] “successfully loaded ‘postprocessing’”
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I0206 15:23:06.690691 721 model_lifecycle.cc:838] “successfully loaded ‘preprocessing’”
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed mMaxSequenceLen. Therefore, it has been adjusted to match the value of mMaxSequenceLen.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 15324 MiB
[TensorRT-LLM][ERROR] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::42] Error Code 1: Serialization (Serialization assertion safeVersionRead== kSAFE_SERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 0, Serialized Engine Version: 239)

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Environment

Triton release version: 24.06
Python version: 3.10.12
TensorRT-LLM version: 0.10.0
CUDA version: 12.4.0.041
CUDA Driver version: 550.54.14
Operating System + Version:
NAME=“Amazon Linux”
VERSION=“2”
ID=“amzn”
ID_LIKE=“centos rhel fedora”
VERSION_ID=“2”
PRETTY_NAME=“Amazon Linux 2”
ANSI_COLOR=“0;33”
CPE_NAME=“cpe:2.3:o:amazon:amazon_linux:2”
HOME_URL=“https://amazonlinux.com/”
SUPPORT_END=“2026-06-30”

Hi @sahil.s.jain ,
This is a TRITON issue and would recommendyou to raise it to the respective forum.

Thanks