I rewrite the demo of Huggingface t5 to bart (https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace), And I inference with method of ‘generate’ ,and paramters with top_p sampling and return_num_sequences=10
like the screenshot, when return_num_sequences=1, the speed of tensorrt engine is faster 2x than pytorch model. But when i increase ‘return_num_sequences’, trt engine is gradually slower than pytoch. when return_num_sequences=10 , trt engine is obviously slower than pytorch model. Who konws about it, and how to resolve it? And the batch_size of trt bart_decoder equal to return_num_sequences when export to trt bart decoder.
The main problem is that when batch_size is small (like 1,2,3), tensorRT is faster than pytorch, but when batch_szie is large (like 10), tensorRT is slower than pytorch.
Does anyone know the impact of batch_size on tensorRT?
Any one knows about it? Tanks!
TensorRT Version: 8.2.2
GPU: T4