Understanding Profiling gpu-trace output of Inference - TFTRT

Hi, I am using TFTRT/example/Image classification application from GIT. Running inference for Resnet50 2 iterations 1 warm up and batch 4 on Imagenet. I am trying to profile using nvprof and load timeline trace data into a visualizer to understand the gpu execution and when inference happens.
The o/p shows that there are several kernel calls(7~8Sec) in the beginning (what exactly is happening here?) and at last it seems like inference happens since i see 2 instances of similar kernel calls in the end of the run. Can some one elaborate what is happening in the beginning for 7~8sec?


For TensorRT, each layer may launch one or more kernels to perform its operations. The exact kernels launched depends on the optimized network and the hardware present. Depending on the choices of the builder, there may be many additional operations that reorder data interspersed with layer computations. Some reformat operations may be implemented as device-to-device memory copies, others with custom kernels.

So using nvprof to decode the kernel names back to layers in the original network can be complicated. When interpreting results from the profiler, it is recommended to start with the IProfiler interface to get per-layer timing information before using nvprof to get per-kernel timing information.

One way to limit the scope of nvprof is to:
First phase
Structure the application to build and then serialize the engines in one phase.
Second phase
Load the serialized engines and run inference in a second phase.
Third phase
Run nvprof on this second phase only.

Hi NVES, Thanks for the reply, So basically at a high level is my approximation that inference is happening a the end is wrong?
I have seen the above IProfiler thing in the nvprof documentation. But i cannot imagine nor could find documentation on Iprofiler to add it to the TFTRT Example codes which are basically TF codes. Can you help me point out how to implement this Iprofiler and where in the TFTRT code.

Thanks in advance.

I think for TFTRT, you can use tensorboard or tensorflow timeline?