I am using tf-trt to accelerate my inception_resnet_v2(.pb) model inference, when i run the session.run() sentence with the same test image,the result are as follows,
it seems that the result is normal, yes, but when i wrap a flask http service, and use the wrk tool(GitHub - wg/wrk: Modern HTTP benchmarking tool) to test the server qps,the session run time are as follows,
TF model: 28ms;
TF-TRT + FP32: 28ms;
I have checked my code seriously, so the question is why ?
the environment are as follows,
docker image from NGC: nvidia-tensorflow-19.06-py3
hardware: NVIDIA T4
I am looking forward to receiving your reply, thank you !
Have you recorded these session run times over many runs and taken the average for a fair comparison? When you involve a web service, there will likely be increased latency for the overhead of extra web communication/http requests/etc.
Firstly, thanks for your kindly reply, I have two questions,
(1) when i use one thread and one http connection with this thread, the params are as follows,
wrk -t1 -c1 -d30s xxx
then, the session run time is stable, mostly around 28ms, so the time can be an average indicator.Then, how to explain the above result?
(2) besides, when i run with 16 thread and 16 http connection with each thread, like the following command,
wrk -t16 -c16 -d30s xxx
the session run time is not stable, vary between 23ms and 79ms. In my opinion, the session run time should be stable even if i send many http connections, awaiting for your reply!
Sorry, I can’t really help with the details of the wrk tool and HTTP connections. I think the developers of that tool would be able to help understand why your latencies are varying more than I could. You can post an issue on their GitHub page to try to get further assistance: Issues · wg/wrk · GitHub