Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

jwitsoe · July 20, 2021, 1:00pm

Originally published at: https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/

○ TensorRT is an SDK for high-performance deep learning inference, and TensorRT 8.0 introduces support for sparsity that uses sparse tensor cores on NVIDIA Ampere GPUs. It can accelerate networks by reducing the computation of zeros present in GEMM operations in neural networks. You get a performance gain compared to dense networks by just following the steps in this post.

leiwen · July 28, 2021, 9:12am

Hi ,

I test resnext101_32x8d_sparse_onnx_v1/resnext101_32x8d_dense_onnx_v1 model with the trtexec as the blog said.
The sparse result as:
[07/28/2021-08:45:44] [I] Throughput: 501.618 qps
[07/28/2021-08:45:44] [I] Latency: min = 1.81982 ms, max = 3.1843 ms, mean = 2.05464 ms, median = 2.02905 ms, percentile(99%) = 2.29761 ms

The dense result as:
[07/28/2021-08:48:35] [I] Throughput: 498.367 qps
[07/28/2021-08:48:35] [I] Latency: min = 1.78882 ms, max = 2.14081 ms, mean = 2.06509 ms, median = 2.06612 ms, percentile(99%) = 2.10205 ms

I also use nsys to profile the kernel trace, but fail to see any different with the kernel used by sparse with the one used by dense model…

I am doing the experiment over 3090 with nvcr.io/nvidia/pytorch:21.07-py3
docker image. Any idea?

Thx

asawarkar · July 30, 2021, 4:14pm

Hello @leiwen, thanks for your comment. There was a mistake in the code snippet. ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1 command downloads the dense model and NOT the sparse model. You can change this to ngc registry model download-version nvidia/resnext101_32x8d_sparse_onnx:1 and then run the trtexec command with sparsity enabled for exporting the onnx model to trt engine. Please let me know if that works thanks!

leiwen · August 7, 2021, 9:20am

Hi @asawarkar ,

I redownload the onnx file, and here are the md5sum of the two files:
c962aeafd8a7000f3c72bbfcd2165572 resnext101_32x8d_sparse_onnx_v1/resnext101_32x8d_pyt_torchvision_sparse.onnx
49beb2920f6f6e42eb20b874a30eab98 resnext101_32x8d_dense_onnx_v1/resnext101_32x8d_pyt_torchvision_dense.onnx

But still I cannot see any different for the performance improve for the sparse onnx model.

When trt build the sparse one, it print below message:
[08/07/2021-09:14:02] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_221 + Add_222 + Relu_223, Conv_224 + Relu_225, Conv_228 + Add_229 + Relu_230, Conv_231 + Relu_232, Conv_235 + Add_236 + Relu_237
[08/07/2021-09:14:02] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_224 + Relu_225, Conv_231 + Relu_232

I assume after enabling structual sparsity, it would at last gain twice speed up with the non sparse kernel?
But from the nsys profile, no big improment is seen.
Could you help list the performance you assume that resnext101_32x8d_pyt_torchvision_sparse.onnx could reach over 3090 platform with sparsity turn on or off? And would twice speed up assumption could be hold for this case?

Thx,
Lei

asawarkar · August 11, 2021, 6:58pm

Hi @leiwen

The assumption of double the performance gain due to structured sparsity is incorrect. We don’t have numbers for 3090 but on A100, the performance gain for ResNeXt101 32x8d should be in the range of 1% to 8% end to end in INT8. If FP16 is used, then sparse vs dense perf gap is larger.

leiwen · August 12, 2021, 3:17am

I think to compare the performance shall take single kernel as example. In previous experience, when switch from fp16 to int8, the same shape convolution would be accelerated upto twice of the origin speed.

As the article also mention that, in ampere, dense int8 has 624Tops, while sparse has 1248 Tops, I think if the kernel is implemented corrected, its performance also shall be twice speed up?

Thx

cucbdm · May 25, 2022, 3:51pm

@asawarkar I am planning to test sparsity gains for bert models and want to know that bert-large onnx also available in the ngc registry like below resnet model ?
ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1

cucbdm · May 27, 2022, 2:31pm

Hi,

As described in the slides, I used the following script to covert pretrained resnet50 model to prune the weights but it’s been more than 24 hours and still pruning process hasn’t completed. Could someone help on this ?

Script used:
import torch
import torchvision
import torch.optim as optim
from torchvision import datasets, transforms, models
try:
from apex.contrib.sparsity import ASP
except ImportError:
raise RuntimeError(“Failed to import ASP. Please install Apex from https:// GitHub - NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch .”)
device = torch.device(‘cuda’)
print(‘Is cuda available: ’ + str(torch.cuda.is_available()))
model = models.resnet50(pretrained=True) # Define model structure
model.load_state_dict(torch.load(’/home/ubuntu/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth’))
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # Define optimizer
ASP.prune_trained_model(model, optimizer)
torch.save(model.state_dict(), ‘pruned_resnetmodel.pth’) # checkpoint has weights and masks

I have also attached the log file.
resnet50_asp_pruning_log.txt (386.3 KB)

cucbdm · June 10, 2022, 4:54pm

Hi , I have run the resnext101_32x8d dense and sparse models for inference for fp16 as described in the following paper and I don’t see any improvement in the inference for sparse model. Could some one look in to it ?

Dense model:
trt generation: tuser@fde6b05b597a:/workspace/TensorRT/build/out/trtexec --onnx=resnext101_32x8d_pyt_torchvision_dense.onnx --saveEngine=resnext101_dense_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16
Batch size: 32000
Precision: fp16
Processing time for 1 loop:4.3s

Sparse model:
trt generation: trtuser@fde6b05b597a:/workspace/TensorRT/build/out/trtexec --onnx=resnext101_32x8d_pyt_torchvision_sparse.onnx --saveEngine=resnext101_sparse_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16 --sparsity=enable
Batch size: 32000
Precision: fp16
Processing time for 1 loop:4.2s

shixingy · July 19, 2022, 4:45pm

Hi, I’m using TensorRT to test with the ResNeXt101. But the problem is as follows:

[07/19/2022-09:37:20] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26,
Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv
_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 +
Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161,
Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194,
Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_221 + Add_222 + Relu_223, Conv_224 + Relu_225,
Conv_228 + Add_229 + Relu_230, Conv_231 + Relu_232, Conv_235 + Add_236 + Relu_237, MatMul_240

[07/19/2022-09:37:20] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[07/19/2022-09:37:20] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3117, GPU 1787 (MiB)
[07/19/2022-09:37:20] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3117, GPU 1797 (MiB)
[07/19/2022-09:37:20] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +8, GPU +339, now: CPU 8, GPU 339 (MiB)
[07/19/2022-09:37:20] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will a
lways return 1.
[07/19/2022-09:37:20] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will a
lways return 1.

There are all the layers that are eligible for sparse math but none in TRT inference plan picked sparse implementation for layers. Could you help me check why this could happen?

shixingy · July 19, 2022, 6:09pm

I’ve solved this problem by adding enable FP16 flag.
However, I’ve noticed that the sparsity module only support convolution operation but barely linear layers(fully connected), is that the case or we need other flag to enable linear layer acceleration?

gzudong · January 11, 2023, 2:33pm

I tested on A100, latest TensorRT and I got:
sparse model qps 885
dense mode qps 870

It seems that the advantage of using sparsity is very little.

whdlsghks2 · April 28, 2023, 4:31am

Hi,
I tested the inference speed but I did not find how to test GPU power consumption.(RTX-A6000)
Could you tell me how to check the power consumption?

Thx

ashish-roopan · June 2, 2023, 7:08am

I am trying to prune the model to 2:4 sparsity and convert it to tensorRT and run on orin nano Ampere architecture. But Iam stuck at the ONNX conversion step.
I was able to prune the model using the ‘ASP.prune_trained_model(model, optimizer)’ command. But the model I got as the output was a mix of pruned weights and masks. Then when I tried to convert it to onnx its not working as the model. eval() is not working. can you please provide the instructions to convert it to onnx?

These are the things I tried.
#. Prune the model
print(“Pruning the model”)
ASP.prune_trained_model(model, optimizer)
for epoch in range(1, args.epochs+1):
print(“\n---- Training Model ----”)

torch.save(model.state_dict(), “yolov3_pruned.pth”)

#ONNX conversion

export the model to ONNX

import torch

pruned_model = torch.load(‘yolov3_pruned.pth’)
dummy_input=torch.randn(1, 3, 224, 224)
torch.onnx.export(pruned_model , dummy_input, “yolov3.onnx”, verbose=False)

But when converting to ONNX this is the error Iam getting this error.
File “/home/ashish/code/pruning/PyTorch-YOLOv3/test.py”, line 5, in
torch.onnx.export(pruned_model , dummy_input, “yolov3.onnx”, verbose=False)
File “/home/ashish/venvs/py310/lib/python3.10/site-packages/torch/onnx/utils.py”, line 506, in export
_export(
File “/home/ashish/venvs/py310/lib/python3.10/site-packages/torch/onnx/utils.py”, line 1525, in _export
with exporter_context(model, training, verbose):
File “/usr/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/home/ashish/venvs/py310/lib/python3.10/site-packages/torch/onnx/utils.py”, line 178, in exporter_context
with select_model_mode_for_export(
File “/usr/lib/python3.10/contextlib.py”, line 135, in enter
return next(self.gen)
File “/home/ashish/venvs/py310/lib/python3.10/site-packages/torch/onnx/utils.py”, line 139, in disable_apex_o2_state_dict_hook
for module in model.modules():
AttributeError: ‘collections.OrderedDict’ object has no attribute ‘modules’

Topic		Replies	Views
2:4 sparsity doesnot improve inference performance on RTX 3090 TensorRT tensorrt	14	3392	September 9, 2022
Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration Technical Blog	0	390	May 26, 2023
Same resnext101 model size for dense and sparse Jetson Nano jetson-inference	7	765	January 4, 2024
Sparsity does not provide any speedup for TensorRT on DLA Jetson AGX Orin cudnn	6	987	January 22, 2024
Stuctured sparsity 2:4 does not improve inference performance on Jetson Orin TensorRT tensorrt	6	911	October 17, 2023
Enabling sparsity to model between other devices using tensorrt TensorRT tensorrt , ai-training	1	822	September 7, 2023
Does network pruning speed up inference speed? TensorRT	6	1742	January 7, 2022
Deep Learning model in TensorRT with SPARSE layers not accelerating on A40 TensorRT tensorrt , yolo , onnx , deep-learning	0	75	August 7, 2024
Accelerating Inference Up to 6x Faster in PyTorch with Torch-TensorRT Technical Blog	18	3559	September 7, 2023
Tensorrt performance General Topics and Other SDKs tensorrt	0	491	March 30, 2022

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

export the model to ONNX

Related topics