but after I downloaded the dense and sparse models by:
ngc registry model download-version nvidia/resnext101_32x8d_sparse_onnx:1
ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1
Surprisingly, the 2 downloaded onnx models have exact file size:
354782502 Dec 20 17:57 resnext101_32x8d_pyt_torchvision_sparse.onnx
354782502 Dec 20 18:00 resnext101_32x8d_pyt_torchvision_dense.onnx
I expect the sparse model has smaller size, but both has exact file size, can this be the reason people reported no performance difference?
Have you try to infer it with TensorRT (ex. trtexec)?
The model might use the same data length to save sparse or dense model so the file size will be identical.
A follow-up question here: at least for resnext101 model, about 25% performance improvement observed on Jetson AGX orin if the model is inferencing on GPU, what about DLA? is there any performance improvement with sparisity enabled when model runs on DLA?
Although DLA can increase inference throughput, it is limited in the supported layer type.
If a model needs to fallback to GPU frequently, the data transfer overhead might slow down the performance.