Description
My deep leaning model, which is converted from a PyTorch model and is pruned with NVIDIA’s ASP (Automatic SParsity) and saved as FP16 ONNX model. This model is then converted with trtexec with sparse option “force” and saved with FP16 precision. But when benchmarking on the A40 GPU no speed latency/throughput in noticed. While I thought this could be due to small batchsize, the same is noticed for other batchsizes: between B1 - B32, after B32 I got out of memory error. More detailed info is given below.
How can I still get sparse matrix acceleration (officially up to 2x increase in compute power, but what i understand is that for gemm and deep learning models the real-world speedup is closer to 1.3-1.5x). Do I something wrong in my steps to generate a sparse TRT engine, should I change dimension in my model, or is nothing wrong of bug in trt etc?
I have pruned the following model with YOLOv8 backbone ONNX FP16 non-sparse, YOLOv8 backbone ONNX FP16 sparse, YOLOv8 backbone Engine FP16 non sparse, YOLOv8 backbone Engine FP16 sparse and checked with ASP.is_sparsity_enabled()
that the model is sparse. Furthermore, when counting the weights 50% are zeroes. The model is then saved as ONNX model as: torch.onnx.export(model, torch.randn((batch_size, 3, 1920, 1080), device="cuda").half(), onnx_model_path, opset_version=17, do_constant_folding=True)
.
This ONNX model is then converted with trtexec with the following:
trtexec --onnx=onnx_model_path --saveEngine=export_engine_path --fp16 --explicitBatch --profilingVerbosity=detailed --sparsity=force
This is the result for batchsize 1:
Info from trtexec for batch size says that most of the layers are eligible for sparse math:
[08/06/2024-17:28:35] [I] [TRT] (Sparsity) Layers eligible for sparse math: /0/conv/Conv, /1/conv/Conv + PWN(PWN(/1/act/Sigmoid), /1/act/Mul), /2/cv1/conv/Conv + PWN(PWN(/2/cv1/act/Sigmoid), /2/cv1/act/Mul), /2/m.0/cv1/conv/Conv + PWN(PWN(/2/m.0/cv1/act/Sigmoid), /2/m.0/cv1/act/Mul), /2/m.1/cv1/conv/Conv + PWN(PWN(/2/m.1/cv1/act/Sigmoid), /2/m.1/cv1/act/Mul), /2/cv2/conv/Conv + PWN(PWN(/2/cv2/act/Sigmoid), /2/cv2/act/Mul), /3/conv/Conv + PWN(PWN(/3/act/Sigmoid), /3/act/Mul), /4/cv1/conv/Conv + PWN(PWN(/4/cv1/act/Sigmoid), /4/cv1/act/Mul), /4/m.0/cv1/conv/Conv + PWN(PWN(/4/m.0/cv1/act/Sigmoid), /4/m.0/cv1/act/Mul), /4/m.1/cv1/conv/Conv + PWN(PWN(/4/m.1/cv1/act/Sigmoid), /4/m.1/cv1/act/Mul), /4/m.2/cv1/conv/Conv + PWN(PWN(/4/m.2/cv1/act/Sigmoid), /4/m.2/cv1/act/Mul), /4/m.3/cv1/conv/Conv + PWN(PWN(/4/m.3/cv1/act/Sigmoid), /4/m.3/cv1/act/Mul), /4/cv2/conv/Conv + PWN(PWN(/4/cv2/act/Sigmoid), /4/cv2/act/Mul), /5/conv/Conv + PWN(PWN(/5/act/Sigmoid), /5/act/Mul), /6/cv1/conv/Conv + PWN(PWN(/6/cv1/act/Sigmoid), /6/cv1/act/Mul), /6/m.0/cv1/conv/Conv + PWN(PWN(/6/m.0/cv1/act/Sigmoid), /6/m.0/cv1/act/Mul), /6/m.1/cv1/conv/Conv + PWN(PWN(/6/m.1/cv1/act/Sigmoid), /6/m.1/cv1/act/Mul), /6/m.2/cv1/conv/Conv + PWN(PWN(/6/m.2/cv1/act/Sigmoid), /6/m.2/cv1/act/Mul), /6/m.3/cv1/conv/Conv + PWN(PWN(/6/m.3/cv1/act/Sigmoid), /6/m.3/cv1/act/Mul), /6/cv2/conv/Conv + PWN(PWN(/6/cv2/act/Sigmoid), /6/cv2/act/Mul), /7/conv/Conv + PWN(PWN(/7/act/Sigmoid), /7/act/Mul), /8/cv1/conv/Conv + PWN(PWN(/8/cv1/act/Sigmoid), /8/cv1/act/Mul), /8/m.0/cv1/conv/Conv + PWN(PWN(/8/m.0/cv1/act/Sigmoid), /8/m.0/cv1/act/Mul), /8/m.1/cv1/conv/Conv + PWN(PWN(/8/m.1/cv1/act/Sigmoid), /8/m.1/cv1/act/Mul), /8/cv2/conv/Conv + PWN(PWN(/8/cv2/act/Sigmoid), /8/cv2/act/Mul), /9/cv1/conv/Conv + PWN(PWN(/9/cv1/act/Sigmoid), /9/cv1/act/Mul), /9/cv2/conv/Conv + PWN(PWN(/9/cv2/act/Sigmoid), /9/cv2/act/Mul)
And the following layers are picked for sparse implementation:
[08/06/2024-16:58:36] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: /5/conv/Conv + PWN(PWN(/5/act/Sigmoid), /5/act/Mul), /6/m.0/cv1/conv/Conv + PWN(PWN(/6/m.0/cv1/act/Sigmoid), /6/m.0/cv1/act/Mul), /6/m.1/cv1/conv/Conv + PWN(PWN(/6/m.1/cv1/act/Sigmoid), /6/m.1/cv1/act/Mul), /6/m.2/cv1/conv/Conv + PWN(PWN(/6/m.2/cv1/act/Sigmoid), /6/m.2/cv1/act/Mul), /6/m.3/cv1/conv/Conv + PWN(PWN(/6/m.3/cv1/act/Sigmoid), /6/m.3/cv1/act/Mul), /7/conv/Conv + PWN(PWN(/7/act/Sigmoid), /7/act/Mul), /8/cv2/conv/Conv + PWN(PWN(/8/cv2/act/Sigmoid), /8/cv2/act/Mul), /9/cv2/conv/Conv + PWN(PWN(/9/cv2/act/Sigmoid), /9/cv2/act/Mul)
While for B32 the same layers are eligible for sparce matrix but a less are selected:
[08/06/2024-17:28:35] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers: /7/conv/Conv + PWN(PWN(/7/act/Sigmoid), /7/act/Mul), /8/m.0/cv1/conv/Conv + PWN(PWN(/8/m.0/cv1/act/Sigmoid), /8/m.0/cv1/act/Mul), /8/m.1/cv1/conv/Conv + PWN(PWN(/8/m.1/cv1/act/Sigmoid), /8/m.1/cv1/act/Mul)
Analyzing the engine with TREX I can confirm that these layers are sparse according to their chosen tactic when compared to other layers.
For all the models with B1 - B32 no speedup is noticed, while multiple layers are selected, which is strange! What is the reason for this? Furthermore, why are there more layers selected in the lower batch size?
With TREX I explored and compared the layers with each other 1:1 non sparse vs sparse: but there is only marginal speedup between the non-sparse and sparse layers as shown below
Environment
TensorRT Version: 8.6.1
GPU Type: A40
Nvidia Driver Version: 535.183.06
CUDA Version: 12.2
Operating System + Version: Ubuntu 22.04 LTS
Python Version (if applicable): 3.10.12
PyTorch Version (if applicable): ‘2.1.0a0+32f93b1’
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/pytorch:23.10-py3