2:4 sparsity doesnot improve inference performance on RTX 3090

Description

Hi guys,

I am trying to use the new sparsity feature in TensorRT 8.0 which is supported on Ampere GPUs. I use the benchmark tool trtexec to measure the inference performance (throughput, latency). trtexec provides three options for sparsity (disable/enable/force), where the force option means pruning the weights to 2:4 compressed format and adopts sparse tensor cores to accelerate sparse MMA operations . However, in my experiemtns, the performance is similar when I set --sparsity=disable and --sparsity=force.

What is the reason for it on RTX 3090?

Thanks!

Environment

TensorRT Version: 8.0 EA
GPU Type: RTX 3090
Nvidia Driver Version: 460
CUDA Version: 11.3
CUDNN Version: 8.0.4
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

structured sparsity

Steps To Reproduce

trtexec --onnx=ResNet50.onnx --sparsity=force --fp16

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

Thanks very much for your response! Currently I am using ResNet50 model for inference. Here are logs produced by trtexec command.
set sparsity=disable: link
set sparsity=force: link

We can see that the TensorRT engine recognizes and picks sparse weights successfully and correctly (in line 95-96) when I use the force option, but the performance does not improve. Could you give me any advice?

Here are detailed logs generated by trtexec --verbose:
resnet50-sparsity-force.log (2.1 MB)
resnet50-sparsity-disable.log (2.0 MB)

Hi @shuo-ouyang,

We reviewed the logs, and it does seem that the sparse kernels have been used. Could you please try following with trtexec and let us know for better assistance.

  1. Add --useCudaGraph to see if using CUDA graph helps at all (probably not)
  2. Add --dumpProfile --separateProfileRun --verbose and share the logs. This will give us per-layer performance breakdown.

Thank you.

Thanks for your reply. I have tried your advice and the results are shown below:

  1. Adding --useCudaGraph option is not helpful for performance.

  2. Here are the dumped logs by option --dumpProfile --separateProfileRun --verbose.
    resnet50-disable-profile.log (2.0 MB)
    resnet50-force-profile.log (2.1 MB)

Looking forward to hearing from you.

Hi @shuo-ouyang,

  • Increasing the BS will probably increase the sparse vs dense performance gain. It looks like BS=1 here. With small BS, there is not enough computations to keep the entire GPU busy. That’s why the sparse kernel is almost identical to dense kernel in terms of runtime.
  • It seems that there are BatchNorms after the residual connections. One suggestion is to move the BatchNorm to be before the residual connections so that they can be fused into the Convolution layer before them.

Thank you.

Hi @spolisetty , thanks very much for your response. I am sorry that both increasing BS and OP fusion do not have performance improvement in my experiments.

@shuo-ouyang,

Could you please confirm what is the BS are you using(after increment)?

Thank you.

@spolisetty Sorry for giving you a confused comment before. I have tested various BS ranging from 1 to 4096 (increasing 4 times exponentially) via command trtexec --onnx=resnet50.onnx --useCudaGraph --fp16 --explicitBatch --batch=BS --sparsity=disable/enable/force. The following table shows the average throughput /latency of different configurations. The unit of throughput and latency are qps and ms respectively.

BS 1 4 16 64 256 1024 4096
sparsity=disable 2077/0.527 2084/0.524 2081/0.527 2087/0.526 2071/0.531 2081/0.527 2081/0.529
sparsity=force 2105/0.523 2107/0.520 2112/0.520 2080/0.529 2084/0.525 2100/0.522 2123/0.518
2:4 pruned model, sparsity=enable 2108/0.522 2096/0.523 2091/0.525 2086/0.527 2097/0.525 2083/0.526 2099/0.525

It is noticeable that we use original (unpruned) model when we set sparsity=disable and sparsity=force. When we set sparsity=enable, we use the pruned model (in 2:4 format) followed NVIDIA Apex asp.

Thanks for your help and looking forward to hearing from you.

Hi @shuo-ouyang,

The --batch flag does not work in this case since ONNX only support explicit batch. We need to export the ONNX file with dynamic batch dimension and then run trtexec with --shapes=<input_tensor_name>:<input_shape> flags.

Thank you.

1 Like

@spolisetty Thanks for you valuable comments and we get ~10% performance (both throughput and latency) improvement on ResNet50 model. We will test more models in future.

@shuo-ouyang,
Thank you for the confirmation.

Hi @shuo-ouyang, I am facing a similar performance issue with InceptionV3 on A6000 GPU. May I know what worked for you?

Sorry, I didn’t test InceptionV3. Based on my experience, we need to enable int8 when we want to use structed sparsity.