Deep Learning model in TensorRT with SPARSE layers not accelerating on A40

michel.vanlier · August 7, 2024, 1:01pm

Description

My deep leaning model, which is converted from a PyTorch model and is pruned with NVIDIA’s ASP (Automatic SParsity) and saved as FP16 ONNX model. This model is then converted with trtexec with sparse option “force” and saved with FP16 precision. But when benchmarking on the A40 GPU no speed latency/throughput in noticed. While I thought this could be due to small batchsize, the same is noticed for other batchsizes: between B1 - B32, after B32 I got out of memory error. More detailed info is given below.

How can I still get sparse matrix acceleration (officially up to 2x increase in compute power, but what i understand is that for gemm and deep learning models the real-world speedup is closer to 1.3-1.5x). Do I something wrong in my steps to generate a sparse TRT engine, should I change dimension in my model, or is nothing wrong of bug in trt etc?

I have pruned the following model with YOLOv8 backbone ONNX FP16 non-sparse, YOLOv8 backbone ONNX FP16 sparse, YOLOv8 backbone Engine FP16 non sparse, YOLOv8 backbone Engine FP16 sparse and checked with ASP.is_sparsity_enabled() that the model is sparse. Furthermore, when counting the weights 50% are zeroes. The model is then saved as ONNX model as: torch.onnx.export(model, torch.randn((batch_size, 3, 1920, 1080), device="cuda").half(), onnx_model_path, opset_version=17, do_constant_folding=True).

This ONNX model is then converted with trtexec with the following:
trtexec --onnx=onnx_model_path --saveEngine=export_engine_path --fp16 --explicitBatch --profilingVerbosity=detailed --sparsity=force

This is the result for batchsize 1:

Info from trtexec for batch size says that most of the layers are eligible for sparse math:
[08/06/2024-17:28:35] [I] [TRT] (Sparsity) Layers eligible for sparse math: /0/conv/Conv, /1/conv/Conv + PWN(PWN(/1/act/Sigmoid), /1/act/Mul), /2/cv1/conv/Conv + PWN(PWN(/2/cv1/act/Sigmoid), /2/cv1/act/Mul), /2/m.0/cv1/conv/Conv + PWN(PWN(/2/m.0/cv1/act/Sigmoid), /2/m.0/cv1/act/Mul), /2/m.1/cv1/conv/Conv + PWN(PWN(/2/m.1/cv1/act/Sigmoid), /2/m.1/cv1/act/Mul), /2/cv2/conv/Conv + PWN(PWN(/2/cv2/act/Sigmoid), /2/cv2/act/Mul), /3/conv/Conv + PWN(PWN(/3/act/Sigmoid), /3/act/Mul), /4/cv1/conv/Conv + PWN(PWN(/4/cv1/act/Sigmoid), /4/cv1/act/Mul), /4/m.0/cv1/conv/Conv + PWN(PWN(/4/m.0/cv1/act/Sigmoid), /4/m.0/cv1/act/Mul), /4/m.1/cv1/conv/Conv + PWN(PWN(/4/m.1/cv1/act/Sigmoid), /4/m.1/cv1/act/Mul), /4/m.2/cv1/conv/Conv + PWN(PWN(/4/m.2/cv1/act/Sigmoid), /4/m.2/cv1/act/Mul), /4/m.3/cv1/conv/Conv + PWN(PWN(/4/m.3/cv1/act/Sigmoid), /4/m.3/cv1/act/Mul), /4/cv2/conv/Conv + PWN(PWN(/4/cv2/act/Sigmoid), /4/cv2/act/Mul), /5/conv/Conv + PWN(PWN(/5/act/Sigmoid), /5/act/Mul), /6/cv1/conv/Conv + PWN(PWN(/6/cv1/act/Sigmoid), /6/cv1/act/Mul), /6/m.0/cv1/conv/Conv + PWN(PWN(/6/m.0/cv1/act/Sigmoid), /6/m.0/cv1/act/Mul), /6/m.1/cv1/conv/Conv + PWN(PWN(/6/m.1/cv1/act/Sigmoid), /6/m.1/cv1/act/Mul), /6/m.2/cv1/conv/Conv + PWN(PWN(/6/m.2/cv1/act/Sigmoid), /6/m.2/cv1/act/Mul), /6/m.3/cv1/conv/Conv + PWN(PWN(/6/m.3/cv1/act/Sigmoid), /6/m.3/cv1/act/Mul), /6/cv2/conv/Conv + PWN(PWN(/6/cv2/act/Sigmoid), /6/cv2/act/Mul), /7/conv/Conv + PWN(PWN(/7/act/Sigmoid), /7/act/Mul), /8/cv1/conv/Conv + PWN(PWN(/8/cv1/act/Sigmoid), /8/cv1/act/Mul), /8/m.0/cv1/conv/Conv + PWN(PWN(/8/m.0/cv1/act/Sigmoid), /8/m.0/cv1/act/Mul), /8/m.1/cv1/conv/Conv + PWN(PWN(/8/m.1/cv1/act/Sigmoid), /8/m.1/cv1/act/Mul), /8/cv2/conv/Conv + PWN(PWN(/8/cv2/act/Sigmoid), /8/cv2/act/Mul), /9/cv1/conv/Conv + PWN(PWN(/9/cv1/act/Sigmoid), /9/cv1/act/Mul), /9/cv2/conv/Conv + PWN(PWN(/9/cv2/act/Sigmoid), /9/cv2/act/Mul)

And the following layers are picked for sparse implementation:
[08/06/2024-16:58:36] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: /5/conv/Conv + PWN(PWN(/5/act/Sigmoid), /5/act/Mul), /6/m.0/cv1/conv/Conv + PWN(PWN(/6/m.0/cv1/act/Sigmoid), /6/m.0/cv1/act/Mul), /6/m.1/cv1/conv/Conv + PWN(PWN(/6/m.1/cv1/act/Sigmoid), /6/m.1/cv1/act/Mul), /6/m.2/cv1/conv/Conv + PWN(PWN(/6/m.2/cv1/act/Sigmoid), /6/m.2/cv1/act/Mul), /6/m.3/cv1/conv/Conv + PWN(PWN(/6/m.3/cv1/act/Sigmoid), /6/m.3/cv1/act/Mul), /7/conv/Conv + PWN(PWN(/7/act/Sigmoid), /7/act/Mul), /8/cv2/conv/Conv + PWN(PWN(/8/cv2/act/Sigmoid), /8/cv2/act/Mul), /9/cv2/conv/Conv + PWN(PWN(/9/cv2/act/Sigmoid), /9/cv2/act/Mul)

While for B32 the same layers are eligible for sparce matrix but a less are selected:
[08/06/2024-17:28:35] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: /7/conv/Conv + PWN(PWN(/7/act/Sigmoid), /7/act/Mul), /8/m.0/cv1/conv/Conv + PWN(PWN(/8/m.0/cv1/act/Sigmoid), /8/m.0/cv1/act/Mul), /8/m.1/cv1/conv/Conv + PWN(PWN(/8/m.1/cv1/act/Sigmoid), /8/m.1/cv1/act/Mul)

Analyzing the engine with TREX I can confirm that these layers are sparse according to their chosen tactic when compared to other layers.

For all the models with B1 - B32 no speedup is noticed, while multiple layers are selected, which is strange! What is the reason for this? Furthermore, why are there more layers selected in the lower batch size?

With TREX I explored and compared the layers with each other 1:1 non sparse vs sparse: but there is only marginal speedup between the non-sparse and sparse layers as shown below

Environment

TensorRT Version: 8.6.1
GPU Type: A40
Nvidia Driver Version: 535.183.06
CUDA Version: 12.2
Operating System + Version: Ubuntu 22.04 LTS
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): ‘2.1.0a0+32f93b1’
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/pytorch:23.10-py3

Topic		Replies	Views
Sparse tensor math speedup on Ampere TensorRT tensorrt , cuda	1	442	December 20, 2023
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT Technical Blog	13	3052	June 2, 2023
2:4 sparsity doesnot improve inference performance on RTX 3090 TensorRT tensorrt	14	3685	September 9, 2022
Sparsity on Onnx Model TensorRT	1	248	December 31, 2024
Stuctured sparsity 2:4 does not improve inference performance on Jetson Orin TensorRT tensorrt	6	1066	October 17, 2023
Difference between --sparsity=enable and --sparcity=disable in .trtexec utility Jetson AGX Orin tensorrt	4	965	October 11, 2022
Sparsity does not provide any speedup for TensorRT on DLA Jetson AGX Orin cudnn	6	1166	January 22, 2024
Tensorrt performance General Topics & Other SDKs tensorrt	0	524	March 30, 2022
Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration Technical Blog	0	435	May 26, 2023
Tensort Sparsity TensorRT	1	96	March 16, 2025

Deep Learning model in TensorRT with SPARSE layers not accelerating on A40

Description

Environment

Related topics