Sparsity does not provide any speedup for TensorRT on DLA

My question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?

Here is what I did to prune and quantize the model.

  1. First I applied sparsity as follows
from apex.contrib.sparsity import ASP
model_sparse.model.cuda()
optimizer_sparse = torch.optim.AdamW(model_sparse.parameters(), lr=learning_rate, weight_decay=0.05)
ASP.prune_trained_model(model_sparse, optimizer_sparse)
trainer.fit(model=model_sparse, train_dataloaders=train_loader)
torch.save({"state_dict": model_sparse.state_dict()}, "/home/orin-1/yue/TLR/models/model_sparse.ckpt")
  1. then I reload the pruned model from the checkpoint and apply the quantization and then export the onnx as follows (model definition and parameters loading codes are omitted)
def prune_trained_model_custom(model, optimizer, compute_sparse_masks=True):
    asp = ASP()
    asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[quant_nn.QuantLinear, quant_nn.QuantConv2d], allow_recompute_mask=False)
    asp.init_optimizer_for_pruning(optimizer)
    if compute_sparse_masks:
        asp.compute_sparse_masks()
  1. then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.
/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0

However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements which is two elements are exactly 0s out of four elements across the input channel dimension, you can also observe this in the onnx model.

[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS

I have attached my onnx model and trt engine for you to reproduce. Thanks!
Desktop.zip (15.0 MB)

Hi,

Based on the 12.7. Sparsity on DLA section in the TensorRT document:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#dla-sparsity

  • Only available for INT8 convolution for formats other than NHWC.
  • Channel size must be larger than 64.

Your model’s input is 1x3x768x864 so it’s NCHW rather than NHWC.
Thanks.

It says other than NHWC, but my model is NCHW, so I think it should be compatible. Could you help double-check? Thanks.

Hi,

Sorry for the missing.

We need to try it internally for further comment.
Will update more info later.

Thanks.

1 Like

Hi,

It requires a 2:4 sparsity pattern.
You can find more details in the below link:

Thanks.

Thanks for checking this. The document you shared is also the one I followed to prune my model and I checked the parameters already satisfying the 2:4 sparsity pattern. Could you double check the onnx model? Thanks!

Hi,

Here is some update from our internal team.

When compiling DLA loadable, layer placement info can be found if setting “DETAILED”
to profilingVerbosity mode:

The DETAILED mode is not exposed through TRT in JetPack 6.0 DP yet.
This feature will be enabled in our future release.

About why you don’t see a speedup with sparsity.
Our internal will discuss this further. Will let you know if we get any updates.

Thanks.