Hi,

I am new and started working on a project optimizing our detection model on tensorrt. However, when I set max_batch_size > 1, inference time increases proportionally. This happens both using python2/onnx_to_tensorrt.py approach, or run the program using trtexec. I firstly found the problem when running my own detection model, and then I changed to the sample model under TensorRT5.0.6 release /samples/python/yolov3_onnx, and it has the same behavior (logs below).

Basically:

batch size=1 fp16 infTime is about 10ms

batch size=4 fp16 infTime is about 39ms

batch size=1 fp32 infTime is about 34.4ms

batch size=4 fp32 infTime is about 129ms

The environment I am using: ubuntu16.04,

Driver Version: 418.39 CUDA Version: 10.1

Using a T4 like shown:

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |

| N/A 68C P0 71W / 70W | 2881MiB / 15079MiB | 100% Default |

±------------------------------±---------------------±---------------------+

| 1 Tesla T4 On | 00000000:5E:00.0 Off | 0 |

| N/A 30C P8 9W / 70W | 10MiB / 15079MiB | 0% Default |

±------------------------------±---------------------±---------------------+

Cuda compilation tools, release 7.5, V7.5.17

TensorRT-5.0.2.6.Ubuntu-16.04.4.x86_64-gnu.cuda-10.0.cudnn7.3.tar.gz

Can somebody explain?

Thanks.

xxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ ll *log

-rw-rw-r-- 1 aifi aifi 1646 Mar 27 20:35 yolov3cuda0bat1fp16.log

-rw-rw-r-- 1 aifi aifi 1640 Mar 27 20:27 yolov3cuda0bat1fp32.log

-rw-rw-r-- 1 aifi aifi 1645 Mar 27 20:30 yolov3cuda0bat4fp16.log

-rw-rw-r-- 1 aifi aifi 1639 Mar 27 20:27 yolov3cuda0bat4fp32.log

xxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat1fp16.log

Average over 10 runs is 17.3652 ms (host walltime is 17.4882 ms, 99% percentile time is 18.1637).

Average over 10 runs is 13.7891 ms (host walltime is 13.8339 ms, 99% percentile time is 14.0738).

Average over 10 runs is 10.3425 ms (host walltime is 10.3858 ms, 99% percentile time is 10.3828).

Average over 10 runs is 10.3546 ms (host walltime is 10.397 ms, 99% percentile time is 10.496).

Average over 10 runs is 10.3128 ms (host walltime is 10.3576 ms, 99% percentile time is 10.3372).

Average over 10 runs is 10.3787 ms (host walltime is 10.4212 ms, 99% percentile time is 10.4978).

Average over 10 runs is 10.337 ms (host walltime is 10.3795 ms, 99% percentile time is 10.3498).

Average over 10 runs is 10.3523 ms (host walltime is 10.3958 ms, 99% percentile time is 10.4998).

Average over 10 runs is 10.3498 ms (host walltime is 10.3923 ms, 99% percentile time is 10.3673).

Average over 10 runs is 10.3173 ms (host walltime is 10.3602 ms, 99% percentile time is 10.4877).

xxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat4fp16.log

Average over 10 runs is 45.7616 ms (host walltime is 45.8084 ms, 99% percentile time is 60.2297).

Average over 10 runs is 38.8411 ms (host walltime is 38.8866 ms, 99% percentile time is 39.2171).

Average over 10 runs is 39.1089 ms (host walltime is 39.1546 ms, 99% percentile time is 39.8079).

Average over 10 runs is 39.1751 ms (host walltime is 39.2205 ms, 99% percentile time is 39.6014).

Average over 10 runs is 39.1758 ms (host walltime is 39.221 ms, 99% percentile time is 39.9894).

Average over 10 runs is 39.0968 ms (host walltime is 39.1409 ms, 99% percentile time is 39.594).

Average over 10 runs is 38.8867 ms (host walltime is 38.9312 ms, 99% percentile time is 39.467).

Average over 10 runs is 39.0935 ms (host walltime is 39.1381 ms, 99% percentile time is 39.9408).

Average over 10 runs is 39.272 ms (host walltime is 39.3178 ms, 99% percentile time is 40.1793).

Average over 10 runs is 39.0277 ms (host walltime is 39.0727 ms, 99% percentile time is 39.4325).

xxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat1fp32.log

Average over 10 runs is 44.0139 ms (host walltime is 44.0628 ms, 99% percentile time is 67.2938).

Average over 10 runs is 34.338 ms (host walltime is 34.3821 ms, 99% percentile time is 35.2482).

Average over 10 runs is 34.3672 ms (host walltime is 34.4122 ms, 99% percentile time is 35.2721).

Average over 10 runs is 34.5807 ms (host walltime is 34.6255 ms, 99% percentile time is 35.3508).

Average over 10 runs is 34.2389 ms (host walltime is 34.286 ms, 99% percentile time is 34.777).

Average over 10 runs is 34.3533 ms (host walltime is 34.3982 ms, 99% percentile time is 34.7647).

Average over 10 runs is 34.4273 ms (host walltime is 34.4717 ms, 99% percentile time is 35.0437).

Average over 10 runs is 34.348 ms (host walltime is 34.3925 ms, 99% percentile time is 35.1601).

Average over 10 runs is 34.4369 ms (host walltime is 34.4814 ms, 99% percentile time is 35.0536).

Average over 10 runs is 34.3577 ms (host walltime is 34.4018 ms, 99% percentile time is 34.7955).

xxxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat4fp32.log

Average over 10 runs is 137.07 ms (host walltime is 137.14 ms, 99% percentile time is 215.524).

Average over 10 runs is 128.768 ms (host walltime is 128.812 ms, 99% percentile time is 129.53).

Average over 10 runs is 128.943 ms (host walltime is 128.987 ms, 99% percentile time is 129.602).

Average over 10 runs is 129.165 ms (host walltime is 129.208 ms, 99% percentile time is 129.451).

Average over 10 runs is 129.211 ms (host walltime is 129.255 ms, 99% percentile time is 129.747).

Average over 10 runs is 129.358 ms (host walltime is 129.4 ms, 99% percentile time is 130.052).

Average over 10 runs is 129.455 ms (host walltime is 129.498 ms, 99% percentile time is 130.278).

Average over 10 runs is 129.661 ms (host walltime is 129.704 ms, 99% percentile time is 130.023).

Average over 10 runs is 129.802 ms (host walltime is 129.848 ms, 99% percentile time is 130.595).

Average over 10 runs is 129.841 ms (host walltime is 129.887 ms, 99% percentile time is 130.664).