Inference on large batch size

I’ve tested tensorrt inference speed on 1080ti recently and the results were unsettling for me (in terms of FPS). When I ran inference on small batch sizes (lets say up to 32) I gathered great results, the speed-up was clear in comparison to tensorflow or caffe. Sadly the results for large batches such as 256, 512 and 1024 were terrible. Not only did other frameworks beat tensorrt, but also (on several models) inference on batch size lets say 256 was even slower than on batch size 128. For the record I ran the inference for some 50 iterations and excluded a couple of initial ones as they are slower, and then took a mean out of the remaining 40 or so. Also I did check the memory usage during the inference, and everything seems to be well, there’s still plenty memory left after I load a data set (I don’t load whole imagenet, just roughly 1000 images). I have no idea why I obtain such poor results, could you maybe help me explain that situation? Is maybe tensorrt not optimized to run on large batches or something like that?

I have the same problem as you ,I also increased the max_workspace_size when building the engine, but got the same result.

Hello,

can you quantify the TRT performance when compared to other frameworks?

Can you provide details on the platforms you are using?

Linux distro and version
GPU type
nvidia driver version
CUDA version
CUDNN version
Python version [if using python]
Tensorflow version
TensorRT version

Ubuntu 16.04
p40/p100/v100 - all three show similar behavior
cuda 9.0 with cudnn 7.1.4 (from docker hub)
python 3.5
trt 4.0.1.6
dont recall the exact tf version, but its not an outdated one for sure
using caffe pretrained models

p40, alexnet, fp32 inference
https://imgur.com/krtOwD1

same config, larger batches
https://imgur.com/a/Jp0crcP

solving this is not crucial for me as I dont need large batches inference for anything specific, it’s just something that caught my attention

Hello,

This is very interesting. We’d expect TRT to converge to cudnn-like perf for large batch. Can you please share

  • the Profile from tensorflow vs. trt
  • link to cafe pretrained model. There is not just one AlexNet.

not sure what you mean by profile, as for alexnet im pretty sure it was https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet but i don’t have an access to full configuration with all the information as i was just tinkering with tensorrt some time ago

also, do you maybe have access to some benchamrks of tensorrt on differnet gpus and with different network topologies? im not convinced my results are correct, i dont know why though, but it would be nice to compare them with something directly from nvidia