How to accelerate Albert with tensorrt?

Description

How to accelerate albert use tensorrt?

Environment

TensorRT Version: 8.0.3
GPU Type: T4
Nvidia Driver Version: 465.19.01
CUDA Version: 11.3
CUDNN Version: 7
Operating System + Version: CentOS 8.2
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

Are you using Tensorflow ALBERT ?

Hi @mdztravelling ,

It happens that I am working right now on a little Python library to help in converting transformers models to tensorrt and / or onnx runtime, and prepare Triton server templates (if you are not interested in Triton, just copy the tensorrt engine).

It’s an ongoing work, README is not yet finished, and OSS licence has to be added, but you can find it there:

For Albert, I just checked, it works out of the box:

convert_model -m albert-base-v2 --batch 16 16 16 --sequence-length 128 128 128 --backend tensorrt onnx pytorch

It should display something like this:

Inference done on NVIDIA GeForce RTX 3090
[TensorRT (FP16)] mean=1.44ms, sd=0.08ms, min=1.40ms, max=2.39ms, median=1.42ms, 95p=1.57ms, 99p=1.84ms
[ONNX Runtime (vanilla)] mean=3.20ms, sd=0.19ms, min=3.11ms, max=4.42ms, median=3.15ms, 95p=3.56ms, 99p=4.22ms
[ONNX Runtime (optimized)] mean=1.72ms, sd=0.12ms, min=1.67ms, max=3.02ms, median=1.69ms, 95p=1.87ms, 99p=2.26ms
[Pytorch (FP32)] mean=9.30ms, sd=0.32ms, min=8.88ms, max=12.35ms, median=9.26ms, 95p=9.75ms, 99p=10.24ms
[Pytorch (FP16)] mean=10.70ms, sd=0.54ms, min=10.19ms, max=18.51ms, median=10.61ms, 95p=11.39ms, 99p=12.98ms

Of course, if you provide a local path instead of HF hub path, it works, you just need to put the tokenizer with it. And if you want to do the stuff yourself, the source code is based on tensorrt Python API, if you work in C++ it’s almost the same.

It works best with TensorRT 8.2 (preview).

Thank you very mush for sharing this repo. @pommedeterresautee
I only need tensorrt engine so I use tf2onnx convert my tf model to onnx,then convert it to trt engine with tensorrt 8.2, but it occured Onehot plugin not find.
Have you ever encountered this error? And how to solve it ?

[11/22/2021-16:48:21] [TRT] [I] No importer registered for op: OneHot. Attempting to import as plugin.
[11/22/2021-16:48:21] [TRT] [I] Searching for plugin: OneHot, plugin_version: 1, plugin_namespace:
[11/22/2021-16:48:21] [TRT] [E] 3: getPluginCreator could not find plugin: OneHot version: 1
[11/22/2021-16:48:21] [TRT] [E] 4: [network.cpp::validate::2410] Error Code 4: Internal Error (Network must have at least one output)
[11/22/2021-16:48:21] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)

nope, never got any plugin issue.
Have you tried trtexec command line? (no FP16 or whatever).

Command line looks like that:

/usr/src/tensorrt/bin/trtexec --onnx="model.onnx" --shapes=input_ids:1x128,attention_mask:1x128 --workspace=10000

above, onnx model exported without token type ids, requires a 10Gb RAM GPU

I use trtexec commnad occured same error and log is as follows.

[11/22/2021-17:15:07] [I] [TRT] Searching for plugin: OneHot, plugin_version: 1, plugin_namespace:
[11/22/2021-17:15:07] [E] [TRT] 3: getPluginCreator could not find plugin: OneHot version: 1
[11/22/2021-17:15:07] [E] [TRT] ModelImporter.cpp:720: While parsing node number 19 [OneHot → “bert/embeddings/one_hot:0”]:
[11/22/2021-17:15:07] [E] [TRT] ModelImporter.cpp:721: — Begin node —
[11/22/2021-17:15:07] [E] [TRT] ModelImporter.cpp:722: input: “bert/embeddings/Reshape_2:0”
input: “const_fold_opt__288”
input: “const_fold_opt__287”
output: “bert/embeddings/one_hot:0”
name: “bert/embeddings/one_hot”
op_type: “OneHot”
attribute {
name: “axis”
i: -1
type: INT
}

[11/22/2021-17:15:07] [E] [TRT] ModelImporter.cpp:723: — End node —
[11/22/2021-17:15:07] [E] [TRT] ModelImporter.cpp:726: ERROR: builtin_op_importers.cpp:4643 In function importFallbackPluginImporter:
[8] Assertion failed: creator && “Plugin not found, are the plugin name, version, and namespace correct?”
[11/22/2021-17:15:07] [E] Failed to parse onnx file
[11/22/2021-17:15:07] [I] Finish parsing network model
[11/22/2021-17:15:07] [E] Parsing model failed
[11/22/2021-17:15:07] [E] Failed to create engine from model.
[11/22/2021-17:15:07] [E] Engine set up failed

Looking at the forum, it seems you need to implement yourself a custom op or probably easier, split the model in different parts (I would advise using netron to know where to split)
Converting onnx to trt: [8] No importer registered for op: OneHot - #3 by francesco.ciannella and
execute just a part of the model in tensorrt (easy to plug the whole thing together on Triton). To help you in splitting the model, I would recommend polygraphy tool.

Probably even easier, don’t know if possible for you, but you may want to use albert v2 from HF hub, it seems to work out of the box (without requiring any custom plugin).

This repo ( GitHub - mdztravelling/albert_trt_plugin: albert tensorrt plugin ) is albert’s plugin with tensorrt 8.0.3.