I use ResNet50 to inference images with 1080Ti. I can’t see any speedup when I switch to the optimized graph. The throughput of the original graph is 140.69 image per second. However, the throughput after optimization is only 122.67 image per second. It seems even slower after optimization.
I refer to the code here.
https://devtalk.nvidia.com/default/topic/1043578/tensorrt/dont-see-any-speedups-using-tensorrt/post/5294291/#5294291
trt_engine_ops = len([1 for n in trt_graph.node if str(n.op)=='TRTEngineOp'])
And find out there are 101 modified nodes.
My questions are:
- The cause of the drawback of optimization.
- was thinking maybe I should use some measurement tool such as nvprof to find out the true cause. Could you give me some guideline to measure events like cache-misses or others?