Stuctured sparsity 2:4 does not improve inference performance on Jetson Orin

Description

Using --int8 --sparsity=enable or force on Yolo v7 ONNX model that was 2:4 pruned is the same latency as without sparsity, ~22ms. Jetson Orin lists 137 TOP/s dense 275 TOP/s sparse, or 2x, it is not clear how to improve performance for sparse?

Environment

TensorRT Version: 8.5.2
GPU Type: Tegra
Nvidia Driver Version: JetPack 5.1.1
CUDA Version: 11.4
CUDNN Version: 8.6.0
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): 3.8
PyTorch Version (if applicable): 2.0.0
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

Steps To Reproduce

git clone GitHub - WongKinYiu/yolov7: Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
cd yolov7
wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6.pt
gedit export.py

between L47 and L48 insert

for name, module in model.named_modules():
if hasattr(module, ‘weight’):
with torch.no_grad():
if len(module.weight.shape) == 2:
module.weight[:, ::2] = 0
elif len(module.weight.shape) == 3:
module.weight[:, :, ::2] = 0
elif len(module.weight.shape) == 4:
module.weight[:, :, :, ::2] = 0
elif len(module.weight.shape) == 5:
module.weight[:, :, :, :, ::2] = 0
if hasattr(module, ‘bias’):
with torch.no_grad():
module.bias[::2] = 0

python export.py --weights yolov7-w6.pt --grid --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 1280 1280 --max-wh 1280

set power to 50W
sudo jetson_clocks

trtexec --onnx=yolov7-w6.onnx --saveEngine=yolov7-w6.trt --int8
Latency: mean = 22 ms
trtexec --onnx=yolov7-w6.onnx --saveEngine=yolov7-w6-0.trt --int8 --sparsity=force
Latency: mean = 22 ms

Hi,

This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.

Thanks!

Hi,

Not sure that these links are directly related, there are also couple of other posts related to DGPU and TensorRT 2:4 sparsity. Specifically this topic is about difference between TensroRT without and with sparsity which does not seem to make a difference. Based on the Jetson Orin and other DGPU specs 2:4 sparsity should improve dense TOP/s by 100% but is not evident from trtexec execution?

Thanks.

Hi @AakankshaS,

As mentioned, this looks like a TensorRT not just Jetson issue, and hence post is better moved back to TensorRT - NVIDIA Developer Forums.
Executing on A5000 with TRT 8.5.0 exhibits similar behavior, difference between dense and 2:4 sparse file is minimal and forcing sparsity on sparse file makes no difference. Is there a contemporary object detection example that exemplifies TRT 2:4 advantage? Thanks.

Dense file:
trtexec --onnx=yolov7-w6.onnx --saveEngine=yolov7-w6.trt --int8
Latency 4.73ms

Sparse file:
trtexec --onnx=yolov7-w6-0.onnx --saveEngine=yolov7-w6-0.trt --int8
Latency 4.44ms

Sparse file:
trtexec --onnx=yolov7-w6-0.onnx --saveEngine=yolov7-w6-00.trt --int8 --sparsity=force
Latency 4.41ms

Hi,
Can this be moved back to TensorRT - NVIDIA Developer Forums since it is not just Jetson Orin specific, so it could get further assistance? Thanks.

Is there a TensorRT sample that exemplifies claimed 2:1 acceleration using 2:4 sparsity optimization vs. dense? Thanks.

Using APEX 2:4 structural sparsity during training, latency stays the same. Is there any object detection model that examples 2:1 acceleration with TensorRT?