Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

Originally published at: Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Developer Blog

○ TensorRT is an SDK for high-performance deep learning inference, and TensorRT 8.0 introduces support for sparsity that uses sparse tensor cores on NVIDIA Ampere GPUs. It can accelerate networks by reducing the computation of zeros present in GEMM operations in neural networks. You get a performance gain compared to dense networks by just following the steps in this post.

Hi ,

I test resnext101_32x8d_sparse_onnx_v1/resnext101_32x8d_dense_onnx_v1 model with the trtexec as the blog said.
The sparse result as:
[07/28/2021-08:45:44] [I] Throughput: 501.618 qps
[07/28/2021-08:45:44] [I] Latency: min = 1.81982 ms, max = 3.1843 ms, mean = 2.05464 ms, median = 2.02905 ms, percentile(99%) = 2.29761 ms

The dense result as:
[07/28/2021-08:48:35] [I] Throughput: 498.367 qps
[07/28/2021-08:48:35] [I] Latency: min = 1.78882 ms, max = 2.14081 ms, mean = 2.06509 ms, median = 2.06612 ms, percentile(99%) = 2.10205 ms

I also use nsys to profile the kernel trace, but fail to see any different with the kernel used by sparse with the one used by dense model…

I am doing the experiment over 3090 with nvcr.io/nvidia/pytorch:21.07-py3
docker image. Any idea?

Thx

Hello @leiwen, thanks for your comment. There was a mistake in the code snippet. ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1 command downloads the dense model and NOT the sparse model. You can change this to ngc registry model download-version nvidia/resnext101_32x8d_sparse_onnx:1 and then run the trtexec command with sparsity enabled for exporting the onnx model to trt engine. Please let me know if that works thanks!

1 Like

Hi @asawarkar ,

I redownload the onnx file, and here are the md5sum of the two files:
c962aeafd8a7000f3c72bbfcd2165572 resnext101_32x8d_sparse_onnx_v1/resnext101_32x8d_pyt_torchvision_sparse.onnx
49beb2920f6f6e42eb20b874a30eab98 resnext101_32x8d_dense_onnx_v1/resnext101_32x8d_pyt_torchvision_dense.onnx

But still I cannot see any different for the performance improve for the sparse onnx model.

When trt build the sparse one, it print below message:
[08/07/2021-09:14:02] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_221 + Add_222 + Relu_223, Conv_224 + Relu_225, Conv_228 + Add_229 + Relu_230, Conv_231 + Relu_232, Conv_235 + Add_236 + Relu_237
[08/07/2021-09:14:02] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_224 + Relu_225, Conv_231 + Relu_232

I assume after enabling structual sparsity, it would at last gain twice speed up with the non sparse kernel?
But from the nsys profile, no big improment is seen.
Could you help list the performance you assume that resnext101_32x8d_pyt_torchvision_sparse.onnx could reach over 3090 platform with sparsity turn on or off? And would twice speed up assumption could be hold for this case?

Thx,
Lei

Hi @leiwen

The assumption of double the performance gain due to structured sparsity is incorrect. We don’t have numbers for 3090 but on A100, the performance gain for ResNeXt101 32x8d should be in the range of 1% to 8% end to end in INT8. If FP16 is used, then sparse vs dense perf gap is larger.

I think to compare the performance shall take single kernel as example. In previous experience, when switch from fp16 to int8, the same shape convolution would be accelerated upto twice of the origin speed.

As the article also mention that, in ampere, dense int8 has 624Tops, while sparse has 1248 Tops, I think if the kernel is implemented corrected, its performance also shall be twice speed up?

Thx

@asawarkar I am planning to test sparsity gains for bert models and want to know that bert-large onnx also available in the ngc registry like below resnet model ?
ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1

Hi,

As described in the slides, I used the following script to covert pretrained resnet50 model to prune the weights but it’s been more than 24 hours and still pruning process hasn’t completed. Could someone help on this ?

Script used:
import torch
import torchvision
import torch.optim as optim
from torchvision import datasets, transforms, models
try:
from apex.contrib.sparsity import ASP
except ImportError:
raise RuntimeError(“Failed to import ASP. Please install Apex from https:// GitHub - NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch .”)
device = torch.device(‘cuda’)
print(‘Is cuda available: ’ + str(torch.cuda.is_available()))
model = models.resnet50(pretrained=True) # Define model structure
model.load_state_dict(torch.load(’/home/ubuntu/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth’))
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # Define optimizer
ASP.prune_trained_model(model, optimizer)
torch.save(model.state_dict(), ‘pruned_resnetmodel.pth’) # checkpoint has weights and masks

I have also attached the log file.
resnet50_asp_pruning_log.txt (386.3 KB)

Hi , I have run the resnext101_32x8d dense and sparse models for inference for fp16 as described in the following paper and I don’t see any improvement in the inference for sparse model. Could some one look in to it ?

Dense model:
trt generation: tuser@fde6b05b597a:/workspace/TensorRT/build/out/trtexec --onnx=resnext101_32x8d_pyt_torchvision_dense.onnx --saveEngine=resnext101_dense_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16
Batch size: 32000
Precision: fp16
Processing time for 1 loop:4.3s

Sparse model:
trt generation: trtuser@fde6b05b597a:/workspace/TensorRT/build/out/trtexec --onnx=resnext101_32x8d_pyt_torchvision_sparse.onnx --saveEngine=resnext101_sparse_engine_pytorch.trt --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16 --sparsity=enable
Batch size: 32000
Precision: fp16
Processing time for 1 loop:4.2s

Hi, I’m using TensorRT to test with the ResNeXt101. But the problem is as follows:

[07/19/2022-09:37:20] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26,
Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv
_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 +
Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161,
Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194,
Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_221 + Add_222 + Relu_223, Conv_224 + Relu_225,
Conv_228 + Add_229 + Relu_230, Conv_231 + Relu_232, Conv_235 + Add_236 + Relu_237, MatMul_240

[07/19/2022-09:37:20] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[07/19/2022-09:37:20] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3117, GPU 1787 (MiB)
[07/19/2022-09:37:20] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3117, GPU 1797 (MiB)
[07/19/2022-09:37:20] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +8, GPU +339, now: CPU 8, GPU 339 (MiB)
[07/19/2022-09:37:20] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will a
lways return 1.
[07/19/2022-09:37:20] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will a
lways return 1.

There are all the layers that are eligible for sparse math but none in TRT inference plan picked sparse implementation for layers. Could you help me check why this could happen?

I’ve solved this problem by adding enable FP16 flag.
However, I’ve noticed that the sparsity module only support convolution operation but barely linear layers(fully connected), is that the case or we need other flag to enable linear layer acceleration?