Hi,
I am new and started working on a project optimizing our detection model on tensorrt. However, when I set max_batch_size > 1, inference time increases proportionally. This happens both using python2/onnx_to_tensorrt.py approach, or run the program using trtexec. I firstly found the problem when running my own detection model, and then I changed to the sample model under TensorRT5.0.6 release /samples/python/yolov3_onnx, and it has the same behavior (logs below).
Basically:
batch size=1 fp16 infTime is about 10ms
batch size=4 fp16 infTime is about 39ms
batch size=1 fp32 infTime is about 34.4ms
batch size=4 fp32 infTime is about 129ms
The environment I am using: ubuntu16.04,
Driver Version: 418.39 CUDA Version: 10.1
Using a T4 like shown:
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 68C P0 71W / 70W | 2881MiB / 15079MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla T4 On | 00000000:5E:00.0 Off | 0 |
| N/A 30C P8 9W / 70W | 10MiB / 15079MiB | 0% Default |
±------------------------------±---------------------±---------------------+
Cuda compilation tools, release 7.5, V7.5.17
TensorRT-5.0.2.6.Ubuntu-16.04.4.x86_64-gnu.cuda-10.0.cudnn7.3.tar.gz
Can somebody explain?
Thanks.
xxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ ll *log
-rw-rw-r-- 1 aifi aifi 1646 Mar 27 20:35 yolov3cuda0bat1fp16.log
-rw-rw-r-- 1 aifi aifi 1640 Mar 27 20:27 yolov3cuda0bat1fp32.log
-rw-rw-r-- 1 aifi aifi 1645 Mar 27 20:30 yolov3cuda0bat4fp16.log
-rw-rw-r-- 1 aifi aifi 1639 Mar 27 20:27 yolov3cuda0bat4fp32.log
xxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat1fp16.log
Average over 10 runs is 17.3652 ms (host walltime is 17.4882 ms, 99% percentile time is 18.1637).
Average over 10 runs is 13.7891 ms (host walltime is 13.8339 ms, 99% percentile time is 14.0738).
Average over 10 runs is 10.3425 ms (host walltime is 10.3858 ms, 99% percentile time is 10.3828).
Average over 10 runs is 10.3546 ms (host walltime is 10.397 ms, 99% percentile time is 10.496).
Average over 10 runs is 10.3128 ms (host walltime is 10.3576 ms, 99% percentile time is 10.3372).
Average over 10 runs is 10.3787 ms (host walltime is 10.4212 ms, 99% percentile time is 10.4978).
Average over 10 runs is 10.337 ms (host walltime is 10.3795 ms, 99% percentile time is 10.3498).
Average over 10 runs is 10.3523 ms (host walltime is 10.3958 ms, 99% percentile time is 10.4998).
Average over 10 runs is 10.3498 ms (host walltime is 10.3923 ms, 99% percentile time is 10.3673).
Average over 10 runs is 10.3173 ms (host walltime is 10.3602 ms, 99% percentile time is 10.4877).
xxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat4fp16.log
Average over 10 runs is 45.7616 ms (host walltime is 45.8084 ms, 99% percentile time is 60.2297).
Average over 10 runs is 38.8411 ms (host walltime is 38.8866 ms, 99% percentile time is 39.2171).
Average over 10 runs is 39.1089 ms (host walltime is 39.1546 ms, 99% percentile time is 39.8079).
Average over 10 runs is 39.1751 ms (host walltime is 39.2205 ms, 99% percentile time is 39.6014).
Average over 10 runs is 39.1758 ms (host walltime is 39.221 ms, 99% percentile time is 39.9894).
Average over 10 runs is 39.0968 ms (host walltime is 39.1409 ms, 99% percentile time is 39.594).
Average over 10 runs is 38.8867 ms (host walltime is 38.9312 ms, 99% percentile time is 39.467).
Average over 10 runs is 39.0935 ms (host walltime is 39.1381 ms, 99% percentile time is 39.9408).
Average over 10 runs is 39.272 ms (host walltime is 39.3178 ms, 99% percentile time is 40.1793).
Average over 10 runs is 39.0277 ms (host walltime is 39.0727 ms, 99% percentile time is 39.4325).
xxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat1fp32.log
Average over 10 runs is 44.0139 ms (host walltime is 44.0628 ms, 99% percentile time is 67.2938).
Average over 10 runs is 34.338 ms (host walltime is 34.3821 ms, 99% percentile time is 35.2482).
Average over 10 runs is 34.3672 ms (host walltime is 34.4122 ms, 99% percentile time is 35.2721).
Average over 10 runs is 34.5807 ms (host walltime is 34.6255 ms, 99% percentile time is 35.3508).
Average over 10 runs is 34.2389 ms (host walltime is 34.286 ms, 99% percentile time is 34.777).
Average over 10 runs is 34.3533 ms (host walltime is 34.3982 ms, 99% percentile time is 34.7647).
Average over 10 runs is 34.4273 ms (host walltime is 34.4717 ms, 99% percentile time is 35.0437).
Average over 10 runs is 34.348 ms (host walltime is 34.3925 ms, 99% percentile time is 35.1601).
Average over 10 runs is 34.4369 ms (host walltime is 34.4814 ms, 99% percentile time is 35.0536).
Average over 10 runs is 34.3577 ms (host walltime is 34.4018 ms, 99% percentile time is 34.7955).
xxxxxx:~/wendy-tensorrt/TensorRT-5.0.2.6.1/samples/python/yolov3_onnx$ tail *bat4fp32.log
Average over 10 runs is 137.07 ms (host walltime is 137.14 ms, 99% percentile time is 215.524).
Average over 10 runs is 128.768 ms (host walltime is 128.812 ms, 99% percentile time is 129.53).
Average over 10 runs is 128.943 ms (host walltime is 128.987 ms, 99% percentile time is 129.602).
Average over 10 runs is 129.165 ms (host walltime is 129.208 ms, 99% percentile time is 129.451).
Average over 10 runs is 129.211 ms (host walltime is 129.255 ms, 99% percentile time is 129.747).
Average over 10 runs is 129.358 ms (host walltime is 129.4 ms, 99% percentile time is 130.052).
Average over 10 runs is 129.455 ms (host walltime is 129.498 ms, 99% percentile time is 130.278).
Average over 10 runs is 129.661 ms (host walltime is 129.704 ms, 99% percentile time is 130.023).
Average over 10 runs is 129.802 ms (host walltime is 129.848 ms, 99% percentile time is 130.595).
Average over 10 runs is 129.841 ms (host walltime is 129.887 ms, 99% percentile time is 130.664).