Hi @mdztravelling ,
It happens that I am working right now on a little Python library to help in converting transformers models to tensorrt and / or onnx runtime, and prepare Triton server templates (if you are not interested in Triton, just copy the tensorrt engine).
It’s an ongoing work, README is not yet finished, and OSS licence has to be added, but you can find it there:
For Albert, I just checked, it works out of the box:
convert_model -m albert-base-v2 --batch 16 16 16 --sequence-length 128 128 128 --backend tensorrt onnx pytorch
It should display something like this:
Inference done on NVIDIA GeForce RTX 3090
[TensorRT (FP16)] mean=1.44ms, sd=0.08ms, min=1.40ms, max=2.39ms, median=1.42ms, 95p=1.57ms, 99p=1.84ms
[ONNX Runtime (vanilla)] mean=3.20ms, sd=0.19ms, min=3.11ms, max=4.42ms, median=3.15ms, 95p=3.56ms, 99p=4.22ms
[ONNX Runtime (optimized)] mean=1.72ms, sd=0.12ms, min=1.67ms, max=3.02ms, median=1.69ms, 95p=1.87ms, 99p=2.26ms
[Pytorch (FP32)] mean=9.30ms, sd=0.32ms, min=8.88ms, max=12.35ms, median=9.26ms, 95p=9.75ms, 99p=10.24ms
[Pytorch (FP16)] mean=10.70ms, sd=0.54ms, min=10.19ms, max=18.51ms, median=10.61ms, 95p=11.39ms, 99p=12.98ms
Of course, if you provide a local path instead of HF hub path, it works, you just need to put the tokenizer with it. And if you want to do the stuff yourself, the source code is based on tensorrt Python API, if you work in C++ it’s almost the same.
It works best with TensorRT 8.2 (preview).