I am trying to inference on NN model from ONNX zoo and while I am doing profiling with onnxruntime with just on image, I am getting for two convolution layers extremely high throughput.
While I am running with NVIDIA profiler, the duration is closer to the expected values. I mean these two layers are behaving as expected they need more runtime to do the operations.
Conv Layer 486 kernel time = 1751530 microseconds, input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 35.58 GFLOPS
Conv Layer 490 kernel time = 1741836 microseconds, input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 35.77 GFLOPS
Now if I am not run the NVIDIA profiler, these two layers having excessive throughput and they are much faster on the runtime than layers with smaller input_dims, output_dims and kernel_filter:
Conv Layer 486 kernel time = 158 microseconds, input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 406 TFLOPS
Conv Layer 490 kernel time = 110 microseconds , input_dim = 1x256x264x200, kernel_filter = 256x256x3x3, output_dim = 1x256x264x200, Throughput = 583 TFLOPS
- What is happening inside of the GPU when I am using NVIDIA profiler ?
- What is happening on the GPU without using any profiling tool and executes bigger layers faster than the smaller?
- Why I am getting excessive TFLOPS when my Theoretical Peak is:
Single Precision FLOPs: 7.046 TFLOPs - What is the right way to estimate the FLOPs. Is there any special factor to divide my estimated calculations?
CUDA Version: 12
Driver Version: 525.125.06
OS: Linux
GPU: GTX 1070