My question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?

Here is what I did to prune and quantize the model.

- First I applied sparsity as follows

```
from apex.contrib.sparsity import ASP
model_sparse.model.cuda()
optimizer_sparse = torch.optim.AdamW(model_sparse.parameters(), lr=learning_rate, weight_decay=0.05)
ASP.prune_trained_model(model_sparse, optimizer_sparse)
trainer.fit(model=model_sparse, train_dataloaders=train_loader)
torch.save({"state_dict": model_sparse.state_dict()}, "/home/orin-1/yue/TLR/models/model_sparse.ckpt")
```

- then I reload the pruned model from the checkpoint and apply the quantization and then export the onnx as follows (model definition and parameters loading codes are omitted)

```
def prune_trained_model_custom(model, optimizer, compute_sparse_masks=True):
asp = ASP()
asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[quant_nn.QuantLinear, quant_nn.QuantConv2d], allow_recompute_mask=False)
asp.init_optimizer_for_pruning(optimizer)
if compute_sparse_masks:
asp.compute_sparse_masks()
```

- then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.

```
/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0
```

However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements which is two elements are exactly 0s out of four elements across the input channel dimension, you can also observe this in the onnx model.

```
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS
```

I have attached my onnx model and trt engine for you to reproduce. Thanks!

Desktop.zip (15.0 MB)