Recently I am trying N:M pruning methods to see the gains in models such as Torchvision ResNet50 or ViT using TensorRT python library. Currently the setting I have -
GPU - A100/H100
Batch Size - 128/256
Pruning Method - ASP
Using Int8/FP16 with calibration data as ImageNet1k
After using this the gains that I am getting for - Sparse_TRT/Dense_TRT is -
For ResNet50
FINAL COMPARISON (mean latency, TRT unless --skip-tensorrt)
Method Mean (ms) p99 (ms) Throughput Speedup vs dense
Dense 36.787 37.120 6959.0 1.000x
ASP 2:4 36.520 36.790 7009.8 1.007x
For ViT-B/16
FINAL COMPARISON (mean latency, TRT unless --skip-tensorrt)
Method Mean (ms) p99 (ms) Throughput Speedup vs dense
Dense 66.534 67.475 3847.6 1.000x
ASP 2:4 55.680 56.360 4597.7 1.195x
I am not sure if these numbers are expected like this or they should be more higher. I have also tried normal N:M sparsity just to see if there is any latency gains and the numbers are almost similar.
Please help me to understand if these make sense or not
Regards