I am working on a benchmarking tool that simulates PyTorch convolutional layers. PyTorch uses cuDNN for convolution operations so does the benchmarking tool.
As far as I can tell from cuDNN API logs and Nsight Systems traces there is no difference in cuDNN function calls in PyTorch and the benchmarking tool.
However, PyTorch convolutions are faster: increasing the number of iterations I can see that the execution time of PyTorch CUDA kernels decreases significantly. On the other hand, increasing the number of iterations in the benchmark does not significantly change CUDA kernels execution time.
I thought it could be the result of PyTorch kernels effectively using GPU cache and changed the benchmarking tool to use the same data in consecutive iterations. However, this had little effect on the execution time of the benchmark CUDA kernels: increasing the number of iterations still did not change the execution time significantly.
I would appreciate any help on what could be the factors that make PyTorch CUDA kernels run faster with the increasing number of iterations.
I am using nvprof tool to measure CUDA kernels execution time. This tool can also measure a lot of other kernel performance parameters, but which metrics should I look at?