TensorRT generating many small kernels from same ONNX on A6000 vs RTX5000


When deploying from open source ONNX of YOLOv4 using COCO on the RTX A6000 (Ampere), seeing many tiny kernels for TRT in nSight and low SMP utilization compared to what we see on the Quadro RTX 5000. Theory is that the TensorRT optimizer is failing to do appropriate kernel fusion.


GPU Type: RTX A6000 (Ampere) vs Quadro RTX 5000 (Turing)

The architectural differences between the Ampere (RTX A6000) and Turing (Quadro RTX 5000) GPUs can affect how TensorRT optimizes models, resulting in variations in kernel generation.

Make sure you are running the most recent version of TensorRT. Newer versions frequently include kernel generation optimizations and improvements.

Please refer to the TensorRT developer guide for more information.