TensorRT inference result are very strange


I am using tf-trt to accelerate my inception_resnet_v2(.pb) model inference, when i run the session.run() sentence with the same test image,the result are as follows,

TF model: 24ms;
TF-TRT + FP32: 13ms;
TF-TRT + FP16: 7ms;

it seems that the result is normal, yes, but when i wrap a flask http service, and use the wrk tool(https://github.com/wg/wrk) to test the server qps,the session run time are as follows,

TF model: 28ms;
TF-TRT + FP32: 28ms;

I have checked my code seriously, so the question is why ?

the environment are as follows,

docker image from NGC: nvidia-tensorflow-19.06-py3
hardware: NVIDIA T4

I am looking forward to receiving your reply, thank you !


Have you recorded these session run times over many runs and taken the average for a fair comparison? When you involve a web service, there will likely be increased latency for the overhead of extra web communication/http requests/etc.

NVIDIA Enterprise Support

Firstly, thanks for your kindly reply, I have two questions,

(1) when i use one thread and one http connection with this thread, the params are as follows,

wrk -t1 -c1 -d30s xxx

then, the session run time is stable, mostly around 28ms, so the time can be an average indicator.Then, how to explain the above result?

(2) besides, when i run with 16 thread and 16 http connection with each thread, like the following command,

wrk -t16 -c16 -d30s xxx

the session run time is not stable, vary between 23ms and 79ms. In my opinion, the session run time should be stable even if i send many http connections, awaiting for your reply!

Hi yulifu_123,

Sorry, I can’t really help with the details of the wrk tool and HTTP connections. I think the developers of that tool would be able to help understand why your latencies are varying more than I could. You can post an issue on their GitHub page to try to get further assistance: https://github.com/wg/wrk/issues

NVIDIA Enterprise Support