Structure Sparsity not working with BERT large


I am trying to use TensorRT to execute a Bert large model with structured sparsity (2:4). However, I cannot get TensorRT to pick a sparse implementation for any of the layers. Could someone look in to this issue ?


TensorRT Version:
GPU Type: A100
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.2.4
Operating System + Version: Ubuntu 18.04.6
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): Not applicable
PyTorch Version (if applicable): 1.9
Baremetal or Container (if container which image + tag): tensorrt-ubuntu18.04-cuda11.4

Steps To Reproduce

/workspace/TensorRT/build/out/trtexec --onnx=sparse-bert-large-uncased-squad_opset11.onnx --saveEngine=bs_256_sparse-bert-large-uncased-squad_opset11.trt --duration=10 --workspace=10000 --fp16 --sparsity=enable --optShapes=input_mask:256x128,segment_ids:256x128,input_ids:256x128 --verbose

Relevant Files

sparse-bert-large-uncased-squad_opset11.log (1.4 MB)


We recommend you to please try on the latest TensorRT version 8.4 GA and if you still face the issue could you please try following with trtexec and share the logs for better assistance.

  1. Add --useCudaGraph to see if using CUDA graph helps at all (probably not)
  2. Add --dumpProfile --separateProfileRun --verbose and share the logs. This will give us per-layer performance breakdown.

Thank you.


I have used latest TensorRT version 8.4 and run the inference. Still I am facing the issue. Please find the logs.

/workspace/TensorRT/build/out/trtexec --onnx=sparse-bert-large-uncased-squad_opset11.onnx --saveEngine=bs_256_sparse-bert-large-uncased-squad_opset11.trt --duration=10 --workspace=10000 --fp16 --useCudaGraph --dumpProfile --separateProfileRun --sparsity=enable --optShapes=input_mask:256x128,segment_ids:256x128,input_ids:256x128 --verbose

Relevant Files:
sparse-bert-large-uncased-squad_opset11.log (1.4 MB)


We went through the logs, looks like layers are not using the sparsity. Could you please share with us the onnx model for better debugging.

Thank you.

@spolisetty , Thanks for checking the logs.

I have generated the ONNX files using below steps.

  1. Downloaded pretrained checkpoint from below link
    BERT PyTorch checkpoint (Large, QA, SQuAD1.1, AMP) | NVIDIA NGC
  2. Pruned the checkpoint with ASP library #(ASP.prune_trained_model(model, optimizer))
  3. Converted model into onnx.

I hope this helps.

Sorry, could you please share with us the ONNX model here or via DM.
It would be helpful for us to quickly look into this issue.

Thank you.

Hi, I am not the original poster but am also facing this issue. When I try to enable sparsity in BERT large with --sparsity=force, I do not see any performance benefit.

My logs are similar to the original poster’s. It appears that all the layers are fused together and are no longer compatible with the sparsity feature? Are there any suggestions to solve this issue? Thank you.

output.log (1.4 MB)
onnx_graph.txt (109.3 KB)


Could you please share with us the ONNX model for better debugging?

Thank you.

Here is my onnx file for BERT large.

I created it using HuggingFace scripts here.


Sorry for the delayed response.
Currently, doesn’t support Sparsity for transformers if the ONNX is used.
The only way to use BERT with sparsity is to use the demo BERT in OSS.

Thank you.

Thank you. Is the sparsity feature only supported for Megatron? When I run the demo scripts to build and benchmark an engine with the --sparse flag, I notice speedup in Megatron-large but not the original BERT-large.

Yes. Currently, Megatron only.