Please provide the following information when requesting support.
Hardware - GPU T4
Hardware - CPU 4
Operating System - Ubuntu
Riva Version - 2.19
Hello. I’m trying to get whisper-large to work in RIVA 2.19 on an Nvidia T4 GPU. However, I get the following error at runtime of riva_init.sh
[TensorRT-LLM] TensorRT-LLM version: 0.17.0
2025-07-11 09:08:21,786 [INFO] Writing Riva model repository to '/data/models'...
2025-07-11 09:08:21,786 [INFO] The riva model repo target directory is /data/models
2025-07-11 09:08:56,601 [INFO] Using obey-precision pass with fp16 TRT
2025-07-11 09:08:56,602 [INFO] Extract_binaries for nn -> /data/models/riva-trt-whisper-large-v3-multi-asr-offline-am-streaming-offline/1
2025-07-11 09:08:56,602 [INFO] extracting {'ckpt': ('trtllm.whisper', 'model_weights.ckpt'), 'tokenizer_model': ('trtllm.whisper', 'multilingual.tiktoken')} -> /data/models/riva-trt-whisper-large-v3-multi-asr-offline-am-streaming-offline/1
[07/11/2025-09:10:29] [TRT-LLM] [W] Option --gather_generation_logits is deprecated, a build flag is not required anymore. Use --output_generation_logits at runtime instead.
2025-07-11 09:10:29,170 [WARNING] [TRT-LLM] [W] Option --gather_generation_logits is deprecated, a build flag is not required anymore. Use --output_generation_logits at runtime instead.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set bert_attention_plugin to float16.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set gemm_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set gemm_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set nccl_plugin to auto.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set lora_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set lora_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set moe_plugin to None.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set moe_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set context_fmha to True.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set context_fmha to True.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set remove_input_padding to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set remove_input_padding to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set reduce_fusion to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set reduce_fusion to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set user_buffer to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set user_buffer to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set tokens_per_block to 32.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set tokens_per_block to 32.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set use_paged_context_fmha to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set multiple_profiles to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set multiple_profiles to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set paged_state to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set paged_state to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set streamingllm to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set streamingllm to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set use_fused_mlp to True.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set use_fused_mlp to True.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set pp_reduce_scatter to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[07/11/2025-09:10:29] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
2025-07-11 09:10:29,308 [WARNING] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[07/11/2025-09:10:29] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
2025-07-11 09:10:29,308 [WARNING] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[07/11/2025-09:10:29] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
2025-07-11 09:10:29,308 [WARNING] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[07/11/2025-09:10:29] [TRT-LLM] [I] Compute capability: (7, 5)
2025-07-11 09:10:29,311 [INFO] [TRT-LLM] [I] Compute capability: (7, 5)
[07/11/2025-09:10:29] [TRT-LLM] [I] SM count: 40
2025-07-11 09:10:29,318 [INFO] [TRT-LLM] [I] SM count: 40
[07/11/2025-09:10:29] [TRT-LLM] [I] SM clock: 1590 MHz
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] SM clock: 1590 MHz
[07/11/2025-09:10:29] [TRT-LLM] [I] int4 TFLOPS: 260
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] int4 TFLOPS: 260
[07/11/2025-09:10:29] [TRT-LLM] [I] int8 TFLOPS: 130
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] int8 TFLOPS: 130
[07/11/2025-09:10:29] [TRT-LLM] [I] fp8 TFLOPS: 0
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] fp8 TFLOPS: 0
[07/11/2025-09:10:29] [TRT-LLM] [I] float16 TFLOPS: 65
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] float16 TFLOPS: 65
[07/11/2025-09:10:29] [TRT-LLM] [I] bfloat16 TFLOPS: 0
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[07/11/2025-09:10:29] [TRT-LLM] [I] float32 TFLOPS: 8
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] float32 TFLOPS: 8
[07/11/2025-09:10:29] [TRT-LLM] [I] Total Memory: 15 GiB
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] Total Memory: 15 GiB
[07/11/2025-09:10:29] [TRT-LLM] [I] Memory clock: 5001 MHz
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] Memory clock: 5001 MHz
[07/11/2025-09:10:29] [TRT-LLM] [I] Memory bus width: 256
2025-07-11 09:10:29,322 [INFO] [TRT-LLM] [I] Memory bus width: 256
[07/11/2025-09:10:29] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
2025-07-11 09:10:29,322 [INFO] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[07/11/2025-09:10:29] [TRT-LLM] [I] PCIe speed: 2500 Mbps
2025-07-11 09:10:29,323 [INFO] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[07/11/2025-09:10:29] [TRT-LLM] [I] PCIe link width: 8
2025-07-11 09:10:29,323 [INFO] [TRT-LLM] [I] PCIe link width: 8
[07/11/2025-09:10:29] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
2025-07-11 09:10:29,323 [INFO] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,572 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,573 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,573 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,573 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
2025-07-11 09:10:29,574 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/11/2025-09:10:29] [TRT-LLM] [I] Set dtype to float16.
2025-07-11 09:10:29,696 [INFO] [TRT-LLM] [I] Set dtype to float16.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set paged_state to False.
2025-07-11 09:10:29,709 [INFO] [TRT-LLM] [I] Set paged_state to False.
[07/11/2025-09:10:29] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
2025-07-11 09:10:29,709 [WARNING] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
[07/11/2025-09:10:29] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
2025-07-11 09:10:29,710 [WARNING] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
[07/11/2025-09:10:29] [TRT] [I] [MemUsageChange] Init CUDA: CPU -15, GPU +0, now: CPU 3268, GPU 103 (MiB)
[07/11/2025-09:10:35] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1141, GPU +194, now: CPU 4518, GPU 297 (MiB)
[07/11/2025-09:10:35] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
2025-07-11 09:10:35,449 [INFO] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
[07/11/2025-09:10:35] [TRT-LLM] [I] Set nccl_plugin to None.
2025-07-11 09:10:35,449 [INFO] [TRT-LLM] [I] Set nccl_plugin to None.
[07/11/2025-09:10:35] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/conv1/conv1d_L3471/CONVOLUTION_0: IConvolutionLayer `input` and `kernel` must be of same type. `input` type is Float but `kernel` is of type Half.)
2025-07-11 09:10:35,609 [ERROR] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/servicemaker/cli/deploy.py", line 103, in deploy_from_rmir
generator.serialize_to_disk(
File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/triton.py", line 497, in serialize_to_disk
module.serialize_to_disk(repo_dir, rmir, config_only, verbose, overwrite, gpus)
File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/triton.py", line 342, in serialize_to_disk
self.update_binary(version_dir, rmir, verbose)
File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/asr.py", line 186, in update_binary
whisper_build(
File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/trtllm_whisper.py", line 670, in run_build
generate_engine(build_args)
File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/trtllm_whisper.py", line 314, in generate_engine
parallel_build(model_config, ckpt_dir, build_config, args.output_dir, workers, args.log_level, model_cls, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 419, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 384, in build_and_save
engine = build_model(build_config,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 377, in build_model
return build(model, build_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/builder.py", line 1260, in build
model(**inputs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/module.py", line 52, in __call__
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 377, in build_model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 384, in build_and_save
engine = build_model(build_config,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 377, in build_model
return build(model, build_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/builder.py", line 1260, in build
model(**inputs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/module.py", line 52, in __call__
output = self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1971, in forward
x = self.conv1(input_features)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/module.py", line 52, in __call__
output = self.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
return conv1d(input, self.weight.value,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/functional.py", line 3471, in conv1d
output_2d = _create_tensor(layer.get_output(0), layer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/functional.py", line 614, in _create_tensor
assert trt_tensor.shape.__len__(
^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: tensor WhisperEncoder/conv1/conv1d_L3471/CONVOLUTION_0_output_0 has an invalid shape
The same script works fine on L4 GPUs. What is the reason and how to fix it?