Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

Originally published at: Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Developer Blog

○ TensorRT is an SDK for high-performance deep learning inference, and TensorRT 8.0 introduces support for sparsity that uses sparse tensor cores on NVIDIA Ampere GPUs. It can accelerate networks by reducing the computation of zeros present in GEMM operations in neural networks. You get a performance gain compared to dense networks by just following the steps in this post.

Hi ,

I test resnext101_32x8d_sparse_onnx_v1/resnext101_32x8d_dense_onnx_v1 model with the trtexec as the blog said.
The sparse result as:
[07/28/2021-08:45:44] [I] Throughput: 501.618 qps
[07/28/2021-08:45:44] [I] Latency: min = 1.81982 ms, max = 3.1843 ms, mean = 2.05464 ms, median = 2.02905 ms, percentile(99%) = 2.29761 ms

The dense result as:
[07/28/2021-08:48:35] [I] Throughput: 498.367 qps
[07/28/2021-08:48:35] [I] Latency: min = 1.78882 ms, max = 2.14081 ms, mean = 2.06509 ms, median = 2.06612 ms, percentile(99%) = 2.10205 ms

I also use nsys to profile the kernel trace, but fail to see any different with the kernel used by sparse with the one used by dense model…

I am doing the experiment over 3090 with nvcr.io/nvidia/pytorch:21.07-py3
docker image. Any idea?

Thx

Hello @leiwen, thanks for your comment. There was a mistake in the code snippet. ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1 command downloads the dense model and NOT the sparse model. You can change this to ngc registry model download-version nvidia/resnext101_32x8d_sparse_onnx:1 and then run the trtexec command with sparsity enabled for exporting the onnx model to trt engine. Please let me know if that works thanks!

1 Like

Hi @asawarkar ,

I redownload the onnx file, and here are the md5sum of the two files:
c962aeafd8a7000f3c72bbfcd2165572 resnext101_32x8d_sparse_onnx_v1/resnext101_32x8d_pyt_torchvision_sparse.onnx
49beb2920f6f6e42eb20b874a30eab98 resnext101_32x8d_dense_onnx_v1/resnext101_32x8d_pyt_torchvision_dense.onnx

But still I cannot see any different for the performance improve for the sparse onnx model.

When trt build the sparse one, it print below message:
[08/07/2021-09:14:02] [I] [TRT] (Sparsity) Layers eligible for sparse math: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_221 + Add_222 + Relu_223, Conv_224 + Relu_225, Conv_228 + Add_229 + Relu_230, Conv_231 + Relu_232, Conv_235 + Add_236 + Relu_237
[08/07/2021-09:14:02] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: Conv_3 + Relu_4, Conv_7, Conv_8 + Add_9 + Relu_10, Conv_11 + Relu_12, Conv_15 + Add_16 + Relu_17, Conv_18 + Relu_19, Conv_22 + Add_23 + Relu_24, Conv_25 + Relu_26, Conv_29, Conv_30 + Add_31 + Relu_32, Conv_33 + Relu_34, Conv_37 + Add_38 + Relu_39, Conv_40 + Relu_41, Conv_44 + Add_45 + Relu_46, Conv_47 + Relu_48, Conv_51 + Add_52 + Relu_53, Conv_54 + Relu_55, Conv_58, Conv_59 + Add_60 + Relu_61, Conv_62 + Relu_63, Conv_66 + Add_67 + Relu_68, Conv_69 + Relu_70, Conv_73 + Add_74 + Relu_75, Conv_76 + Relu_77, Conv_80 + Add_81 + Relu_82, Conv_83 + Relu_84, Conv_87 + Add_88 + Relu_89, Conv_90 + Relu_91, Conv_94 + Add_95 + Relu_96, Conv_97 + Relu_98, Conv_101 + Add_102 + Relu_103, Conv_104 + Relu_105, Conv_108 + Add_109 + Relu_110, Conv_111 + Relu_112, Conv_115 + Add_116 + Relu_117, Conv_118 + Relu_119, Conv_122 + Add_123 + Relu_124, Conv_125 + Relu_126, Conv_129 + Add_130 + Relu_131, Conv_132 + Relu_133, Conv_136 + Add_137 + Relu_138, Conv_139 + Relu_140, Conv_143 + Add_144 + Relu_145, Conv_146 + Relu_147, Conv_150 + Add_151 + Relu_152, Conv_153 + Relu_154, Conv_157 + Add_158 + Relu_159, Conv_160 + Relu_161, Conv_164 + Add_165 + Relu_166, Conv_167 + Relu_168, Conv_171 + Add_172 + Relu_173, Conv_174 + Relu_175, Conv_178 + Add_179 + Relu_180, Conv_181 + Relu_182, Conv_185 + Add_186 + Relu_187, Conv_188 + Relu_189, Conv_192 + Add_193 + Relu_194, Conv_195 + Relu_196, Conv_199 + Add_200 + Relu_201, Conv_202 + Relu_203, Conv_206 + Add_207 + Relu_208, Conv_209 + Relu_210, Conv_213 + Add_214 + Relu_215, Conv_216 + Relu_217, Conv_220, Conv_224 + Relu_225, Conv_231 + Relu_232

I assume after enabling structual sparsity, it would at last gain twice speed up with the non sparse kernel?
But from the nsys profile, no big improment is seen.
Could you help list the performance you assume that resnext101_32x8d_pyt_torchvision_sparse.onnx could reach over 3090 platform with sparsity turn on or off? And would twice speed up assumption could be hold for this case?

Thx,
Lei

Hi @leiwen

The assumption of double the performance gain due to structured sparsity is incorrect. We don’t have numbers for 3090 but on A100, the performance gain for ResNeXt101 32x8d should be in the range of 1% to 8% end to end in INT8. If FP16 is used, then sparse vs dense perf gap is larger.

I think to compare the performance shall take single kernel as example. In previous experience, when switch from fp16 to int8, the same shape convolution would be accelerated upto twice of the origin speed.

As the article also mention that, in ampere, dense int8 has 624Tops, while sparse has 1248 Tops, I think if the kernel is implemented corrected, its performance also shall be twice speed up?

Thx