TensorRT Engine Outputs Identical Values for All Keypoints in ViT-based Model

Hardware Platform: GPU
DeepStream Version: deepstream-app version 7.0.0
DeepStreamSDK 7.0.0
JetPack Version: N/A (GPU Platform)
TensorRT Version: 8.6.1 (Python), Unknown (Binary)
NVIDIA GPU Driver Version: 565.77
Issue Type: Bug - TensorRT engine produces identical outputs for all keypoints

Problem Description

I have a Vision Transformer (ViT) based keypoint detection model that works correctly in PyTorch and ONNX Runtime, but when converted to TensorRT engine, all 7 keypoints produce identical coordinate values instead of the expected diverse keypoint locations.

Model Details

  • Architecture: DualHeadViTPose (ViT backbone with dual heads for coordinates and visibility)
  • Input: 1x3x512x512 (FP32/FP16)
  • Output: 1x7x2 (coordinates only, expecting 7 different keypoint locations)
  • ONNX Opset: 14
  • PyTorch Version: 2.4.1

Hardware Setup

  • GPU: NVIDIA GeForce RTX 3080
  • GPU Memory: 10240 MB
  • Driver: 565.77
  • CUDA: 12.1 (PyTorch)
  • OS: Ubuntu 22.04.5 LTS

TensorRT Build Command Used

trtexec --onnx=keypoint_coords_only.onnx \
        --saveEngine=keypoint_model.engine \
        --fp16 \
        --workspace=4096 \
        --hardwareCompatibilityLevel=ampere+ \
        --minShapes=input:1x3x512x512 \
        --optShapes=input:1x3x512x512 \
        --maxShapes=input:4x3x512x512 \
        --inputIOFormats=fp16:chw \
        --outputIOFormats=fp16:chw \
        --builderOptimizationLevel=5

Build Warnings Observed

  • 108 weights affected by subnormal FP16 values
  • 48 weights below FP16 minimum subnormal value
  • “Running layernorm after self-attention in FP16 may cause overflow”
  • External tactic sources disabled due to hardware compatibility mode

Build Results

  • Build Time: 276.42 seconds
  • Engine Size: 171 MiB
  • Performance: 4.045ms mean latency, 252 QPS throughput
  • Status: Build completes successfully with warnings

Issue Details

Expected Behavior (PyTorch/ONNX):

# 7 different keypoint coordinates
[[x1, y1], [x2, y2], [x3, y3], [x4, y4], [x5, y5], [x6, y6], [x7, y7]]
# Example: [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8], ...]

Actual Behavior (TensorRT):

# All keypoints have identical coordinates
[[x, y], [x, y], [x, y], [x, y], [x, y], [x, y], [x, y]]
# Example: [[0.5, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 0.5], ...]

Code for Reproduction

# TensorRT inference code 
output_layer = get_tensor_meta_layer(tensor_meta, "output")
output_array = tensor_meta_layer_to_numpy(output_layer)
print(output_array.shape)  # (1, 7, 2) - correct shape
print(output_array[0])     # All 7 keypoints have identical [x,y] values

Detailed Investigation

  1. Shape Verification: Output tensor has correct shape (1, 7, 2)
  2. Value Analysis: All 7 keypoints output identical coordinate pairs
  3. Input Validation: Same input produces diverse keypoints in PyTorch/ONNX
  4. Model Architecture: Uses learnable keypoint queries and cross-attention
  5. ONNX Verification: ONNX model produces expected diverse keypoint outputs

Moving to TensorRT forum for better support, thanks.

1 Like