TensorRT produce all zero output for Qwen3-Embedding-0.6B

Description

I downloaded Qwen/Qwen3-Embedding-0.6B from HuggingFace, and convert it to a dynamic shape onnx(see the attachment named test_qwen3_embedding.py). Then I use trtexec to convert onnx model to TensorRT Engine by following command:

/usr/src/tensorrt/bin/trtexec --onnx=qwen3_embedding_0.6b.onnx --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:1x1024,attention_mask:1x1024 --maxShapes=input_ids:1x4096,attention_mask:1x4096 --fp16 --saveEngine=qwen3_embedding_0.6b.engine

However when I run the engine with TensorRT C++ API and TritonServer, both of them output zeros.

Environment

I test on A100 in the NGC Container nvcr.io/nvidia/tritonserver:23.07-py3.

TensorRT Version: 8.6.1
GPU Type: dGPU
Nvidia Driver Version: 535.183.01
CUDA Version: 12.1
CUDNN Version:
Operating System + Version: 22.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Steps To Reproduce

  • Pull and run docker image nvcr.io/nvidia/tritonserver:23.07-py3.
  • Download Qwen/Qwen3-Embedding-0.6B from HuggingFace.
  • Convert it to ONNX model by test_qwen3_embedding.txt (4.2 KB), this is a Python script.
  • Convert it to TensorRT Engine by above command line.
  • Run it by tritonserver.
  • Send a HTTP or gRPC Request by tritonclient.

You will get all zero outputs.

1 Like

I observed the weights and activation distribution of the original model and found that Qwen3RMSNorm may overflow FP16, so I convert TensorRT Engine without --fp16. In ONNX model there are plenty of cast node to maintain the precision, it seems TensorRT will delete them? So I have to convert a FP32 TensorRT engine? However, even though FP32 engine produces none zero outputs, it still can’t align to the outputs of ONNX model.

2 Likes

I pulled the lastest tritonserver image nvcr.io/nvidia/tritonserver:25.05-py3, it has lastest TensorRT v10.0.10, and it works. Now I can get the correct outputs. However my target platform is Drive Orin, which only have TensorRT 8.6.1 deb package released. My question are:

  1. what are the differences between v10.0.10 and v8.6.1?
  2. why --fp16 doesn’t work well?
1 Like

After confirmation by an expert, I believe that there are some bugs of RMSNorm in TensorRT 8.6, both precision and efficiency, even in FP32 data type. So I implement a fp16 RMSNorm plugin and replace all related ops to a RMSNorm node. Then convert it with --fp16 option, it works well.
However it still can’t explain why --fp16 doesn’t work well in TensorRT 10.0.

1 Like

Hi, @RicardoLu. I also catch the problem. My trt model inference and return wrong output. It different with Raw model and ONNX model (ONNX model work well)

My environmet:

  • Docker image: nvcr.io/nvidia/deepstream:6.3-triton-multiarch
  • TensorRT: 10.3.0
  • CUDA: 12.6
  • Python: 3.10.12
  • Polygraphy: 0.49.24
  • Transformers: 4.53.1

Here is my step:

1. I export ONNX model using optimum-cli

optimum-cli export onnx --model Qwen/Qwen3-Embedding-0.6B --task feature-extraction --opset 19 
 models/qwen3_embedding_0.6b_onnx 

2. Use trtexec to build trt engine:

trtexec --onnx=models/qwen3_embedding_0.6b_onnx/model.onnx \
    --minShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:1x1024 \
    --optShapes=input_ids:4x1024,attention_mask:4x1024,position_ids:4x1024 \
    --maxShapes=input_ids:8x1024,attention_mask:8x1024,position_ids:8x1024 \
    --fp16 \
    --saveEngine=qwen3_embedding_0.6b.engine
trtexec --onnx=models/qwen3_embedding_0.6b_onnx/model.onnx \
    --minShapes=input_ids:1x1024,attention_mask:1x1024,position_ids:1x1024 \
    --optShapes=input_ids:4x1024,attention_mask:4x1024,position_ids:4x1024 \
    --maxShapes=input_ids:8x1024,attention_mask:8x1024,position_ids:8x1024 \
    --best \
    --saveEngine=model_repository/qwen3_embedding_0.6b/1/qwen3_embedding_0.6b.engine

3. Use python script to check trt model:


from transformers import AutoTokenizer, AutoModel
from polygraphy.backend.trt import EngineFromBytes, TrtRunner
import time
import numpy as np
import torch
import sys
import os



def run_tensorrt_polygraphy_model(texts, engine_path):

    
    print("\n" + "=" * 50)
    print("RUNNING TENSORRT MODEL ")
    print("=" * 50)
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B")
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    try:
        # Load TensorRT engine using Polygraphy
        with open(engine_path, 'rb') as f:
            engine_bytes = f.read()
        engine = EngineFromBytes(engine_bytes)
        
        print(f"TensorRT engine loaded from {engine_path}")
        
        # Create TensorRT runner
        with TrtRunner(engine) as runner:
            # Tokenize inputs
            start_time = time.time()
            inputs = tokenizer(
                texts,
                padding='max_length',
                max_length=1024,
                truncation=True,
                return_tensors="np"
            )
            tokenization_time = time.time() - start_time
            print(f"Tokenization time: {tokenization_time:.4f}s")
            
            # Print input info
            print("Input shapes:")
            for key, value in inputs.items():
                print(f"  {key}: {value.shape}")
            
            # Prepare inputs
            input_ids = inputs['input_ids'].astype(np.int64)
            attention_mask = inputs['attention_mask'].astype(np.int64)
            position_ids = np.arange(1024, dtype=np.int64)[np.newaxis, :].repeat(len(texts), axis=0)
            
            print(f"  position_ids: {position_ids.shape}")
            
            # Prepare input dictionary for TensorRT
            input_dict = {
                'input_ids': input_ids,
                'attention_mask': attention_mask,
                'position_ids': position_ids
            }
            
            # Run inference
            start_time = time.time()
            outputs = runner.infer(input_dict)
            inference_time = time.time() - start_time
            print(f"Inference time: {inference_time:.4f}s")
            
            # Get output (assuming first output is last_hidden_state)
            output_names = list(outputs.keys())
            print(f"Available outputs: {output_names}")
            
            # Get the main output (should be last_hidden_state)
            last_hidden_state = None
            for output_name in output_names:
                if 'last_hidden_state' in output_name.lower() or len(output_names) == 1:
                    last_hidden_state = outputs[output_name]
                    break
            
            if last_hidden_state is None:
                # Take the first output if no clear match
                last_hidden_state = outputs[output_names[0]]
                print(f"Using output: {output_names[0]}")
            
            print(f"TensorRT output shape: {last_hidden_state.shape}")
            print(f"TensorRT output range: [{last_hidden_state.min():.6f}, {last_hidden_state.max():.6f}]")
            print(f"TensorRT output mean: {last_hidden_state.mean():.6f}")
            print(f"TensorRT output std: {last_hidden_state.std():.6f}")
            
            # Pool embeddings
            embeddings_list = pool_embeddings(last_hidden_state, attention_mask)
            embeddings = np.stack(embeddings_list)
            print(f"Embeddings shape after pooling: {embeddings.shape}")
            
            # Calculate similarity
            similarity_scores, normalized_embeddings = calculate_similarity(embeddings)
            
            # Calculate norms
            norms = np.linalg.norm(normalized_embeddings, axis=1)
            print(f"Embedding norms: {norms.tolist()}")
            
            print(f"Similarity scores:")
            print(f"  TensorRT scores: {similarity_scores}")
            
            return {
                'embeddings': normalized_embeddings,
                'raw_outputs': last_hidden_state,
                'inputs': {k: v for k, v in inputs.items()},
                'similarity_scores': similarity_scores,
                'inference_time': inference_time,
                'output_stats': {
                    'min': float(last_hidden_state.min()),
                    'max': float(last_hidden_state.max()),
                    'mean': float(last_hidden_state.mean()),
                    'std': float(last_hidden_state.std())
                }
            }
        
    except Exception as e:
        print(f"TensorRT Polygraphy model failed: {e}")
        import traceback
        traceback.print_exc()
        return None

def main():
    texts = [    
        "What is the capital of China?",
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies toward each other."
    ]
    engine_path = "/qwen3_embedding_0.6b/1/qwen3_embedding_0.6b.engine"
    tensorrt_result = run_tensorrt_polygraphy_model(texts, engine_path)
    
if __name__ == "__main__":
    main() 

Hi @nabang1010. Try to convert fp32 model under TRT10+, it should work fine. BTW, TRT10+ support bf16, you can also try it, but as far as I known, efficiency of bf16 is worse than fp16.

1 Like

@RicardoLu @nabang1010 can we use tensorRT-LLM for deploying this model?

1 Like

I tried but it so hard. Finally, i used vLLM

1 Like

@nabang1010 in vLLM, were you able to deploy the quantised model ?
as with tensor RT FP16, I am facing the issue of all dimensions of the embedding being 0

sorry, my target platform is drive-orin, it doesn’t support TRT-LLM.

Hi @nabang1010 , after a long time searching, I found that options like --fp16, --int8, --bf16 are be deprecated and superseded by strong typing. So you can just use --stronglyTyped when convert TensorRT engine without --fp16, it should work fine with TensorRT > 10.