Inference time is not improving with the increase in batch size

Description

I am using Huggingface(Bert-large-cased) model and converted it to ONNX format using transformers[onnx] library.
And when I am converting onnx model tensorrt engine, I don’t see improvement in latency with the increase in batch size…Can you please help with this…

command:-
/usr/src/tensorrt/bin/trtexec --onnx=/git/notebooks/onnx/model.onnx --saveEngine=bert_base.trt --shapes=input_ids:1x512,attention_mask:1x512,token_type_ids:1x512 --workspace=4096

output:-
[05/26/2022-16:43:12] [I] === Performance summary ===
[05/26/2022-16:43:12] [I] Throughput: 31.895 qps
[05/26/2022-16:43:12] [I] Latency: min = 31.0613 ms, max = 31.6614 ms, mean = 31.327 ms, median = 31.2918 ms, percentile(99%) = 31.6614 ms

command:-
/usr/src/tensorrt/bin/trtexec --onnx=/git/notebooks/onnx/model.onnx --saveEngine=bert_base.trt --shapes=input_ids:8x512,attention_mask:8x512,token_type_ids:8x512 --workspace=4096

output:-
[05/26/2022-16:48:24] [I] === Performance summary ===
[05/26/2022-16:48:24] [I] Throughput: 4.42512 qps
[05/26/2022-16:48:24] [I] Latency: min = 224.912 ms, max = 226.356 ms, mean = 225.977 ms, median = 226.124 ms, percentile(99%) = 226.356 ms

command:-
/usr/src/tensorrt/bin/trtexec --onnx=/git/notebooks/onnx/model.onnx --saveEngine=bert_base.trt --shapes=input_ids:32x512,attention_mask:32x512,token_type_ids:32x512 --workspace=4096

output:-
[05/26/2022-16:53:20] [I] === Performance summary ===
[05/26/2022-16:53:20] [I] Throughput: 1.13289 qps
[05/26/2022-16:53:20] [I] Latency: min = 879.309 ms, max = 884.625 ms, mean = 882.779 ms, median = 882.981 ms, percentile(99%) = 884.625 ms

Environment

TensorRT Version: 8.2.4.2
GPU Type: V100-SXM2
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2.2
CUDNN Version: 8.2.1.32
Operating System + Version: ubuntu-20.04.1
Python Version (if applicable): 3.7
TensorFlow Version (if applicable): 2.7
PyTorch Version (if applicable): n/a
Baremetal or Container (if container which image + tag): container

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

Hi,

We recommend you to please try on the latest TensorRT version 8.4 EA and if you still face this issue please share with us the ONNX model to try from our end for better debugging.

Thank you.

Hi, I tried with 8.4 EA, there is no change in results. Can you please suggest where to upload ONNX file? Can I attach to this response?

command:-
/usr/src/tensorrt/bin/trtexec --onnx=/git/notebooks/onnx/model.onnx --saveEngine=bert_base.trt --shapes=input_ids:1x512,attention_mask:1x512,token_type_ids:1x512 --workspace=4096

output:-
[05/27/2022-12:50:36] [I] === Performance summary ===
[05/27/2022-12:50:36] [I] Throughput: 32.5049 qps
[05/27/2022-12:50:36] [I] Latency: min = 30.4623 ms, max = 30.9917 ms, mean = 30.7379 ms, median = 30.7373 ms, percentile(99%) = 30.9917 ms
[05/27/2022-12:50:36] [I] End-to-End Host Latency: min = 30.4785 ms, max = 31.0029 ms, mean = 30.7483 ms, median = 30.7476 ms, percentile(99%) = 31.0029 ms

command:-
/usr/src/tensorrt/bin/trtexec --onnx=/git/notebooks/onnx/model.onnx --saveEngine=bert_base.trt --shapes=input_ids:8x512,attention_mask:8x512,token_type_ids:8x512 --workspace=4096

output:-
[05/27/2022-12:52:36] [I] === Performance summary ===
[05/27/2022-12:52:36] [I] Throughput: 4.56083 ups
[05/27/2022-12:52:36] [I] Latency: min = 218.283 ms, max = 220.419 ms, mean = 219.258 ms, median = 219.195 ms, percentile(99%) = 220.419 ms
[05/27/2022-12:52:36] [I] End-to-End Host Latency: min = 218.291 ms, max = 220.43 ms, mean = 219.27 ms, median = 219.206 ms, percentile(99%) = 220.43 ms

command:-
/usr/src/tensorrt/bin/trtexec --onnx=/git/notebooks/onnx/model.onnx --saveEngine=bert_base.trt --shapes=input_ids:32x512,attention_mask:32x512,token_type_ids:32x512 --workspace=4096

output:-
[05/27/2022-12:56:47] [I] === Performance summary ===
[05/27/2022-12:56:47] [I] Throughput: 1.13098 qps
[05/27/2022-12:56:47] [I] Latency: min = 881.25 ms, max = 887.634 ms, mean = 884.279 ms, median = 884.527 ms, percentile(99%) = 887.634 ms
[05/27/2022-12:56:47] [I] End-to-End Host Latency: min = 881.266 ms, max = 887.651 ms, mean = 884.295 ms, median = 884.543 ms, percentile(99%) = 887.651 ms

You can share with us gdrive link or upload it directly in the message.
If the model is confidential you can DM us.

Thank you.

Hi, Since the file size for bert-large-cased is large, I shared the file for ‘bert-base-cased’.
Same problem is seen for both the models…
I email’d google drive link of onnx file to you…

Thanks

Hi,

Sorry missed conveying the following.
When we checked logs found there is already a throughput improvement between batch_size=8 and batch_size=1.

  • batch_size=1: 32.5049
  • batch_size=8: 4.56083 * 8 = 36.48664
  • batch_size=32: 1.13098 * 32 = 36.19136

On our end as well we observed similar results.

  • batch_size=1: 100.496
  • batch_size=8: 14.9188 * 8 = 119.3504
  • batch_size=32: 3.74531 * 32 = 119.84992

And the throughput saturates at batch_size=8 because BERT is a large model that already saturates the GPU’s computation resource even at lower batch_size. Especially when we use FP32 on V100. If we use FP16, the throughput saturation point will be higher.

Thank you.

thanks for your support.
I can see improvement in latency with increase in batch size for fp16.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.