I am trying to use the new sparsity feature in TensorRT 8.0 which is supported on Ampere GPUs. I use the benchmark tool trtexec to measure the inference performance (throughput, latency). trtexec provides three options for sparsity (disable/enable/force), where the force option means pruning the weights to 2:4 compressed format and adopts sparse tensor cores to accelerate sparse MMA operations . However, in my experiemtns, the performance is similar when I set --sparsity=disable and --sparsity=force.
What is the reason for it on RTX 3090?
Thanks!
Environment
TensorRT Version: 8.0 EA GPU Type: RTX 3090 Nvidia Driver Version: 460 CUDA Version: 11.3 CUDNN Version: 8.0.4 Operating System + Version: Ubuntu 18.04 Python Version (if applicable): 3.6 TensorFlow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):
You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation
Also, request you to share your model and script if not shared already so that we can help you better.
Thanks very much for your response! Currently I am using ResNet50 model for inference. Here are logs produced by trtexec command.
set sparsity=disable: link
set sparsity=force: link
We can see that the TensorRT engine recognizes and picks sparse weights successfully and correctly (in line 95-96) when I use the force option, but the performance does not improve. Could you give me any advice?
We reviewed the logs, and it does seem that the sparse kernels have been used. Could you please try following with trtexec and let us know for better assistance.
Add --useCudaGraph to see if using CUDA graph helps at all (probably not)
Add --dumpProfile --separateProfileRun --verbose and share the logs. This will give us per-layer performance breakdown.
Increasing the BS will probably increase the sparse vs dense performance gain. It looks like BS=1 here. With small BS, there is not enough computations to keep the entire GPU busy. That’s why the sparse kernel is almost identical to dense kernel in terms of runtime.
It seems that there are BatchNorms after the residual connections. One suggestion is to move the BatchNorm to be before the residual connections so that they can be fused into the Convolution layer before them.
Hi @spolisetty , thanks very much for your response. I am sorry that both increasing BS and OP fusion do not have performance improvement in my experiments.
@spolisetty Sorry for giving you a confused comment before. I have tested various BS ranging from 1 to 4096 (increasing 4 times exponentially) via command trtexec --onnx=resnet50.onnx --useCudaGraph --fp16 --explicitBatch --batch=BS --sparsity=disable/enable/force. The following table shows the average throughput /latency of different configurations. The unit of throughput and latency are qps and ms respectively.
BS
1
4
16
64
256
1024
4096
sparsity=disable
2077/0.527
2084/0.524
2081/0.527
2087/0.526
2071/0.531
2081/0.527
2081/0.529
sparsity=force
2105/0.523
2107/0.520
2112/0.520
2080/0.529
2084/0.525
2100/0.522
2123/0.518
2:4 pruned model, sparsity=enable
2108/0.522
2096/0.523
2091/0.525
2086/0.527
2097/0.525
2083/0.526
2099/0.525
It is noticeable that we use original (unpruned) model when we set sparsity=disable and sparsity=force. When we set sparsity=enable, we use the pruned model (in 2:4 format) followed NVIDIA Apex asp.
Thanks for your help and looking forward to hearing from you.
The --batch flag does not work in this case since ONNX only support explicit batch. We need to export the ONNX file with dynamic batch dimension and then run trtexec with --shapes=<input_tensor_name>:<input_shape> flags.
@spolisetty Thanks for you valuable comments and we get ~10% performance (both throughput and latency) improvement on ResNet50 model. We will test more models in future.