Poor performance TF-TRT versus TensorRT C++ API

Hi, I have trained my own model (using transfer learning with resnet50 as the base model) and generated the tensorRT inferences using TF-TRT versus TensorRT C++ API. I was expecting higher performance with the TRT C++ API implementation compared to TF-TRT but I got the opposite results. Could you please assist?. See the chart below:

GPU model: Tesla T4

TF-TRT5 Environment: Ubuntu 16.04.5 LTS | nvidia driver 410.72 | TensorRT 5.0 | TensorFlow 1.10| CUDA 10.0 | Imagenet | script inference.py | docker image nvcr.io/nvidia/tensorflow:18.10-py3

Native TRT5 Environment: Ubuntu 16.04.5 LTS | nvidia driver 410.79 | TensorRT 5.0.2 | CUDA 10.0 | script trtexec.cpp | nvcr.io/nvidia/tensorrt:18.11-py3

Looking at your graph it looks like you’re only seeing an issue once the batch size hits 32. Is that right? If you can get me a small repro I can look into it more. PM me directly if you don’t want to post code or data to the public forum.

hello, the issue is with the entire set of batch sizes using native TensorRT C++ API, I ran the same tests with the pre-trained model resnet-50 as the benchmark and similar throughput is what we are expecting for our custom model, please see enclosed the chart. I will send a PM with the repro.

Hi Kevin, I have sent you a PM with the repro instructions, thanks in advance for your support!