Accelerate inference of bart with tensorRT

I rewrite the demo of Huggingface t5 to bart (https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace), And I inference with method of ‘generate’ ,and paramters with top_p sampling and return_num_sequences=10


like the screenshot, when return_num_sequences=1, the speed of tensorrt engine is faster 2x than pytorch model. But when i increase ‘return_num_sequences’, trt engine is gradually slower than pytoch. when return_num_sequences=10 , trt engine is obviously slower than pytorch model. Who konws about it, and how to resolve it? And the batch_size of trt bart_decoder equal to return_num_sequences when export to trt bart decoder.

The main problem is that when batch_size is small (like 1,2,3), tensorRT is faster than pytorch, but when batch_szie is large (like 10), tensorRT is slower than pytorch.
Does anyone know the impact of batch_size on tensorRT?
Any one knows about it? Tanks!

TensorRT Version: 8.2.2
GPU: T4

Hi,

Could you please share with us minimal issue repro to try from our end for better debugging.

Thank you.

I rewrite the demo of Huggingface t5 to bart, And I inference with method of 'generate' ,and paramters with top_p sampling and return_num_sequences=10 · Issue #1756 · NVIDIA/TensorRT · GitHub this url is my issue of TensorRT bart with test scripts.
And the url of scripts is https://pan.baidu.com/s/1C_N17uLBbLhJRTCqc3GnRA password is: v27t
Looking forward to reply.
Thank you very much!

Hi,

This link seems to be not working, its redirecting to Issues · NVIDIA/TensorRT · GitHub, if possible could you please share via gdrive or any other working url.

Thank you.