Sparsity does not provide any speedup for TensorRT on DLA

slimwangyue · January 9, 2024, 3:17pm

My question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?

Here is what I did to prune and quantize the model.

First I applied sparsity as follows

from apex.contrib.sparsity import ASP
model_sparse.model.cuda()
optimizer_sparse = torch.optim.AdamW(model_sparse.parameters(), lr=learning_rate, weight_decay=0.05)
ASP.prune_trained_model(model_sparse, optimizer_sparse)
trainer.fit(model=model_sparse, train_dataloaders=train_loader)
torch.save({"state_dict": model_sparse.state_dict()}, "/home/orin-1/yue/TLR/models/model_sparse.ckpt")

then I reload the pruned model from the checkpoint and apply the quantization and then export the onnx as follows (model definition and parameters loading codes are omitted)

def prune_trained_model_custom(model, optimizer, compute_sparse_masks=True):
    asp = ASP()
    asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[quant_nn.QuantLinear, quant_nn.QuantConv2d], allow_recompute_mask=False)
    asp.init_optimizer_for_pruning(optimizer)
    if compute_sparse_masks:
        asp.compute_sparse_masks()

then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.

/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0

However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements which is two elements are exactly 0s out of four elements across the input channel dimension, you can also observe this in the onnx model.

[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS

I have attached my onnx model and trt engine for you to reproduce. Thanks!
Desktop.zip (15.0 MB)

AastaLLL · January 10, 2024, 2:18am

Hi,

Based on the 12.7. Sparsity on DLA section in the TensorRT document:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dla-sparsity

Only available for INT8 convolution for formats other than NHWC.
Channel size must be larger than 64.

Your model’s input is 1x3x768x864 so it’s NCHW rather than NHWC.
Thanks.

slimwangyue · January 10, 2024, 1:42pm

It says other than NHWC, but my model is NCHW, so I think it should be compatible. Could you help double-check? Thanks.

AastaLLL · January 11, 2024, 5:39am

Hi,

Sorry for the missing.

We need to try it internally for further comment.
Will update more info later.

Thanks.

AastaLLL · January 16, 2024, 7:00am

Hi,

It requires a 2:4 sparsity pattern.
You can find more details in the below link:

Thanks.

slimwangyue · January 16, 2024, 1:35pm

Thanks for checking this. The document you shared is also the one I followed to prune my model and I checked the parameters already satisfying the 2:4 sparsity pattern. Could you double check the onnx model? Thanks!

AastaLLL · January 22, 2024, 3:04am

Hi,

Here is some update from our internal team.

When compiling DLA loadable, layer placement info can be found if setting “DETAILED”
to profilingVerbosity mode:

The DETAILED mode is not exposed through TRT in JetPack 6.0 DP yet.
This feature will be enabled in our future release.

About why you don’t see a speedup with sparsity.
Our internal will discuss this further. Will let you know if we get any updates.

Thanks.

Topic		Replies	Views
TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time Jetson AGX Orin tensorrt , cuda , dla	10	1634	January 17, 2024
Structured sparsity not working with explicit quantization TensorRT tensorrt	5	995	March 31, 2022
Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT Technical Blog	13	2857	June 2, 2023
Sparsity in INT8: Training Workflow and Best Practices for NVIDIA TensorRT Acceleration Technical Blog	0	390	May 26, 2023
Problem with structured sparsity and explicit quantization (PTQ) on Tiny-Yolov7 TensorRT	5	778	May 26, 2023
Stuctured sparsity 2:4 does not improve inference performance on Jetson Orin TensorRT tensorrt	6	911	October 17, 2023
2:4 sparsity doesnot improve inference performance on RTX 3090 TensorRT tensorrt	14	3392	September 9, 2022
Tensorrt performance General Topics and Other SDKs tensorrt	0	491	March 30, 2022
Enabling sparsity to model between other devices using tensorrt TensorRT tensorrt , ai-training	1	822	September 7, 2023
Same resnext101 model size for dense and sparse Jetson Nano jetson-inference	7	766	January 4, 2024

Sparsity does not provide any speedup for TensorRT on DLA

Related topics