Whisper deployment error on Nvidia T4

serhii-artemuk · July 11, 2025, 9:21am

Please provide the following information when requesting support.

Hardware - GPU T4
Hardware - CPU 4
Operating System - Ubuntu
Riva Version - 2.19

Hello. I’m trying to get whisper-large to work in RIVA 2.19 on an Nvidia T4 GPU. However, I get the following error at runtime of riva_init.sh

[TensorRT-LLM] TensorRT-LLM version: 0.17.0
2025-07-11 09:08:21,786 [INFO] Writing Riva model repository to '/data/models'...
2025-07-11 09:08:21,786 [INFO] The riva model repo target directory is /data/models
2025-07-11 09:08:56,601 [INFO] Using obey-precision pass with fp16 TRT
2025-07-11 09:08:56,602 [INFO] Extract_binaries for nn -> /data/models/riva-trt-whisper-large-v3-multi-asr-offline-am-streaming-offline/1
2025-07-11 09:08:56,602 [INFO] extracting {'ckpt': ('trtllm.whisper', 'model_weights.ckpt'), 'tokenizer_model': ('trtllm.whisper', 'multilingual.tiktoken')} -> /data/models/riva-trt-whisper-large-v3-multi-asr-offline-am-streaming-offline/1
[07/11/2025-09:10:29] [TRT-LLM] [W] Option --gather_generation_logits is deprecated, a build flag is not required anymore. Use --output_generation_logits at runtime instead.
2025-07-11 09:10:29,170 [WARNING] [TRT-LLM] [W] Option --gather_generation_logits is deprecated, a build flag is not required anymore. Use --output_generation_logits at runtime instead.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set bert_attention_plugin to float16.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set gemm_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set gemm_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set nccl_plugin to auto.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set lora_plugin to None.
2025-07-11 09:10:29,180 [INFO] [TRT-LLM] [I] Set lora_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set moe_plugin to None.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set moe_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set context_fmha to True.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set context_fmha to True.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
2025-07-11 09:10:29,181 [INFO] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set remove_input_padding to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set remove_input_padding to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set reduce_fusion to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set reduce_fusion to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set user_buffer to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set user_buffer to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set tokens_per_block to 32.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set tokens_per_block to 32.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set use_paged_context_fmha to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set multiple_profiles to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set multiple_profiles to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set paged_state to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set paged_state to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set streamingllm to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set streamingllm to False.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set use_fused_mlp to True.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set use_fused_mlp to True.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set pp_reduce_scatter to False.
2025-07-11 09:10:29,182 [INFO] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[07/11/2025-09:10:29] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
2025-07-11 09:10:29,308 [WARNING] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[07/11/2025-09:10:29] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
2025-07-11 09:10:29,308 [WARNING] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[07/11/2025-09:10:29] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
2025-07-11 09:10:29,308 [WARNING] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[07/11/2025-09:10:29] [TRT-LLM] [I] Compute capability: (7, 5)
2025-07-11 09:10:29,311 [INFO] [TRT-LLM] [I] Compute capability: (7, 5)
[07/11/2025-09:10:29] [TRT-LLM] [I] SM count: 40
2025-07-11 09:10:29,318 [INFO] [TRT-LLM] [I] SM count: 40
[07/11/2025-09:10:29] [TRT-LLM] [I] SM clock: 1590 MHz
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] SM clock: 1590 MHz
[07/11/2025-09:10:29] [TRT-LLM] [I] int4 TFLOPS: 260
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] int4 TFLOPS: 260
[07/11/2025-09:10:29] [TRT-LLM] [I] int8 TFLOPS: 130
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] int8 TFLOPS: 130
[07/11/2025-09:10:29] [TRT-LLM] [I] fp8 TFLOPS: 0
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] fp8 TFLOPS: 0
[07/11/2025-09:10:29] [TRT-LLM] [I] float16 TFLOPS: 65
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] float16 TFLOPS: 65
[07/11/2025-09:10:29] [TRT-LLM] [I] bfloat16 TFLOPS: 0
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[07/11/2025-09:10:29] [TRT-LLM] [I] float32 TFLOPS: 8
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] float32 TFLOPS: 8
[07/11/2025-09:10:29] [TRT-LLM] [I] Total Memory: 15 GiB
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] Total Memory: 15 GiB
[07/11/2025-09:10:29] [TRT-LLM] [I] Memory clock: 5001 MHz
2025-07-11 09:10:29,319 [INFO] [TRT-LLM] [I] Memory clock: 5001 MHz
[07/11/2025-09:10:29] [TRT-LLM] [I] Memory bus width: 256
2025-07-11 09:10:29,322 [INFO] [TRT-LLM] [I] Memory bus width: 256
[07/11/2025-09:10:29] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
2025-07-11 09:10:29,322 [INFO] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[07/11/2025-09:10:29] [TRT-LLM] [I] PCIe speed: 2500 Mbps
2025-07-11 09:10:29,323 [INFO] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[07/11/2025-09:10:29] [TRT-LLM] [I] PCIe link width: 8
2025-07-11 09:10:29,323 [INFO] [TRT-LLM] [I] PCIe link width: 8
[07/11/2025-09:10:29] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
2025-07-11 09:10:29,323 [INFO] [TRT-LLM] [I] PCIe bandwidth: 2 GB/s
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,572 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,573 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,573 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
2025-07-11 09:10:29,573 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.FLOAT but set to DataType.HALF
[07/11/2025-09:10:29] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
2025-07-11 09:10:29,574 [WARNING] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[07/11/2025-09:10:29] [TRT-LLM] [I] Set dtype to float16.
2025-07-11 09:10:29,696 [INFO] [TRT-LLM] [I] Set dtype to float16.
[07/11/2025-09:10:29] [TRT-LLM] [I] Set paged_state to False.
2025-07-11 09:10:29,709 [INFO] [TRT-LLM] [I] Set paged_state to False.
[07/11/2025-09:10:29] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
2025-07-11 09:10:29,709 [WARNING] [TRT-LLM] [W] max_seq_len 3000 is larger than max_position_embeddings 1500 * rotary scaling 1, the model accuracy might be affected
[07/11/2025-09:10:29] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
2025-07-11 09:10:29,710 [WARNING] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
[07/11/2025-09:10:29] [TRT] [I] [MemUsageChange] Init CUDA: CPU -15, GPU +0, now: CPU 3268, GPU 103 (MiB)
[07/11/2025-09:10:35] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1141, GPU +194, now: CPU 4518, GPU 297 (MiB)
[07/11/2025-09:10:35] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
2025-07-11 09:10:35,449 [INFO] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
[07/11/2025-09:10:35] [TRT-LLM] [I] Set nccl_plugin to None.
2025-07-11 09:10:35,449 [INFO] [TRT-LLM] [I] Set nccl_plugin to None.
[07/11/2025-09:10:35] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/conv1/conv1d_L3471/CONVOLUTION_0: IConvolutionLayer `input` and `kernel` must be of same type. `input` type is Float but `kernel` is of type Half.)
2025-07-11 09:10:35,609 [ERROR] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/servicemaker/cli/deploy.py", line 103, in deploy_from_rmir
    generator.serialize_to_disk(
  File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/triton.py", line 497, in serialize_to_disk
    module.serialize_to_disk(repo_dir, rmir, config_only, verbose, overwrite, gpus)
  File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/triton.py", line 342, in serialize_to_disk
    self.update_binary(version_dir, rmir, verbose)
  File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/asr.py", line 186, in update_binary
    whisper_build(
  File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/trtllm_whisper.py", line 670, in run_build
    generate_engine(build_args)
  File "/usr/local/lib/python3.12/dist-packages/servicemaker/triton/trtllm_whisper.py", line 314, in generate_engine
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir, workers, args.log_level, model_cls, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 419, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 384, in build_and_save
    engine = build_model(build_config,
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 377, in build_model
    return build(model, build_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/builder.py", line 1260, in build
    model(**inputs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/module.py", line 52, in __call__
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 377, in build_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 384, in build_and_save
    engine = build_model(build_config,
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/build.py", line 377, in build_model
    return build(model, build_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/builder.py", line 1260, in build
    model(**inputs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1971, in forward
    x = self.conv1(input_features)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/functional.py", line 3471, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/functional.py", line 614, in _create_tensor
    assert trt_tensor.shape.__len__(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: tensor WhisperEncoder/conv1/conv1d_L3471/CONVOLUTION_0_output_0 has an invalid shape

The same script works fine on L4 GPUs. What is the reason and how to fix it?

Topic		Replies	Views
Deploying Whisper model Riva	4	255	January 24, 2025
Encounter "Unsupported model IR version: 9, max supported IR version: 8" during deploy custom model in riva for TTS Riva onnx , riva	9	3388	January 22, 2024
RIVA v2.15.0 fails to build NeMo model Riva	0	400	March 30, 2024
RIVA error, when deploying official Conformer ASR network Riva riva	10	1964	January 27, 2023
Error in riva deployment Riva deployment aborted Riva ubuntu , nemo , riva	3	1116	February 27, 2023
Failed to deploy citrinet nemo to riva Riva riva	0	612	December 3, 2021
Triton server died before reaching ready state. Terminating Riva startup Riva	15	7713	November 8, 2023
Nvidia Riva health check fail Riva riva	1	472	February 14, 2025
No ASR text output after building riva-build to use en-GB, and the running riva-start Riva	19	1121	October 21, 2022
Wrong outputs from our fine-tuned version of speechtotext_english_citrinet_1024.tlt after deploying using riva_init.sh Riva inception	3	783	August 12, 2022

Whisper deployment error on Nvidia T4

Related topics