Structure Sparsity not working with BERT large

Description

I am trying to use TensorRT to execute a Bert large model with structured sparsity (2:4). However, I cannot get TensorRT to pick a sparse implementation for any of the layers. Could someone look in to this issue ?

Environment

TensorRT Version: 8.2.0.6
GPU Type: A100
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.2.4
Operating System + Version: Ubuntu 18.04.6
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): Not applicable
PyTorch Version (if applicable): 1.9
Baremetal or Container (if container which image + tag): tensorrt-ubuntu18.04-cuda11.4

Steps To Reproduce

/workspace/TensorRT/build/out/trtexec --onnx=sparse-bert-large-uncased-squad_opset11.onnx --saveEngine=bs_256_sparse-bert-large-uncased-squad_opset11.trt --duration=10 --workspace=10000 --fp16 --sparsity=enable --optShapes=input_mask:256x128,segment_ids:256x128,input_ids:256x128 --verbose

Relevant Files

sparse-bert-large-uncased-squad_opset11.log (1.4 MB)

Hi,

We recommend you to please try on the latest TensorRT version 8.4 GA and if you still face the issue could you please try following with trtexec and share the logs for better assistance.

  1. Add --useCudaGraph to see if using CUDA graph helps at all (probably not)
  2. Add --dumpProfile --separateProfileRun --verbose and share the logs. This will give us per-layer performance breakdown.

Thank you.

Hi,

I have used latest TensorRT version 8.4 and run the inference. Still I am facing the issue. Please find the logs.

Command:
/workspace/TensorRT/build/out/trtexec --onnx=sparse-bert-large-uncased-squad_opset11.onnx --saveEngine=bs_256_sparse-bert-large-uncased-squad_opset11.trt --duration=10 --workspace=10000 --fp16 --useCudaGraph --dumpProfile --separateProfileRun --sparsity=enable --optShapes=input_mask:256x128,segment_ids:256x128,input_ids:256x128 --verbose

Relevant Files:
sparse-bert-large-uncased-squad_opset11.log (1.4 MB)

Hi,

We went through the logs, looks like layers are not using the sparsity. Could you please share with us the onnx model for better debugging.

Thank you.

@spolisetty , Thanks for checking the logs.

I have generated the ONNX files using below steps.

  1. Downloaded pretrained checkpoint from below link
    BERT PyTorch checkpoint (Large, QA, SQuAD1.1, AMP) | NVIDIA NGC
  2. Pruned the checkpoint with ASP library #(ASP.prune_trained_model(model, optimizer))
  3. Converted model into onnx.

I hope this helps.

Sorry, could you please share with us the ONNX model here or via DM.
It would be helpful for us to quickly look into this issue.

Thank you.