I am trying to use TensorRT to execute a Bert large model with structured sparsity (2:4). However, I cannot get TensorRT to pick a sparse implementation for any of the layers. Could someone look in to this issue ?
Environment
TensorRT Version: 8.2.0.6 GPU Type: A100 Nvidia Driver Version: CUDA Version: 11.4 CUDNN Version: 8.2.4 Operating System + Version: Ubuntu 18.04.6 Python Version (if applicable): 3.6.9 TensorFlow Version (if applicable): Not applicable PyTorch Version (if applicable): 1.9 Baremetal or Container (if container which image + tag): tensorrt-ubuntu18.04-cuda11.4
We recommend you to please try on the latest TensorRT version 8.4 GA and if you still face the issue could you please try following with trtexec and share the logs for better assistance.
Add --useCudaGraph to see if using CUDA graph helps at all (probably not)
Add --dumpProfile --separateProfileRun --verbose and share the logs. This will give us per-layer performance breakdown.
Hi, I am not the original poster but am also facing this issue. When I try to enable sparsity in BERT large with --sparsity=force, I do not see any performance benefit.
My logs are similar to the original poster’s. It appears that all the layers are fused together and are no longer compatible with the sparsity feature? Are there any suggestions to solve this issue? Thank you.
Sorry for the delayed response.
Currently, doesn’t support Sparsity for transformers if the ONNX is used.
The only way to use BERT with sparsity is to use the demo BERT in OSS.
Thank you. Is the sparsity feature only supported for Megatron? When I run the demo scripts to build and benchmark an engine with the --sparse flag, I notice speedup in Megatron-large but not the original BERT-large.