Inference on NN onnx model by using NVIDIA profiler

I am trying to inference on NN model from ONNX zoo and while I am doing profiling with onnxruntime with just on image, I am getting for two convolution layers extremely high throughput.

While I am running with NVIDIA profiler, the duration is closer to the expected values. I mean these two layers are behaving as expected they need more runtime to do the operations.

Conv Layer 486 kernel time = 1751530 microseconds, input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 35.58 GFLOPS
Conv Layer 490 kernel time = 1741836 microseconds, input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 35.77 GFLOPS

Now if I am not run the NVIDIA profiler, these two layers having excessive throughput and they are much faster on the runtime than layers with smaller input_dims, output_dims and kernel_filter:

Conv Layer 486 kernel time = 158 microseconds, input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 406 TFLOPS
Conv Layer 490 kernel time = 110 microseconds , input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 583 TFLOPS

  • What is happening inside of the GPU when I am using NVIDIA profiler ?
  • What is happening on the GPU without using any profiling tool and executes bigger layers faster than the smaller?
  • Why I am getting excessive TFLOPS when my Theoretical Peak is:
    Single Precision FLOPs: 7.046 TFLOPs
  • What is the right way to estimate the FLOPs. Is there any special factor to divide my estimated calculations?

CUDA Version: 12
Driver Version: 525.125.06
OS: Linux
GPU: GTX 1070