Performance drop for large batches and float16


So I’ve been testing TensorRT recently and it seems that batch 32 is the way to go as it yields the best performance results. For example using caffe pretrained alexnet I got about 2500FPS using batch 32 (running for 100 iterations) and only 1600 when using batch 64. Changing max workspace size from 1GB to 5 or 10 does nothing.

I’m using Ubuntu 16.04 with trt5, running on single 1080ti. Can you help me and explain this weir phenomenon? I’m eager to share my exact scripts I’m using, but maybe on private message, so if you’re interested in that please let me know. I can also share the exact specification of my machine.

Thanks in advance.

P.S. I tried running fp16 inference and I get better performance only on batch 1, for every other batch I notice a performance drop as well, this just seems weird. What do you think?

P.S.2. If you want me to send you some logs/scripts/hardware info please tell me exactly what you need in order to expedite this process.


Can you provide a package of the code/model/scripts you are using for converting the pretrained model, performing inference on it, and recording the FPS? You can share it via a private message if you’d like.

Please also provide the exact TensorRT/CUDA/CUDNN versions.

NVIDIA Enterprise Support

I sent you a private message.