Factors affecting CUDA kernel performance

I am working on a benchmarking tool that simulates PyTorch convolutional layers. PyTorch uses cuDNN for convolution operations so does the benchmarking tool.

As far as I can tell from cuDNN API logs and Nsight Systems traces there is no difference in cuDNN function calls in PyTorch and the benchmarking tool.
However, PyTorch convolutions are faster: increasing the number of iterations I can see that the execution time of PyTorch CUDA kernels decreases significantly. On the other hand, increasing the number of iterations in the benchmark does not significantly change CUDA kernels execution time.

I thought it could be the result of PyTorch kernels effectively using GPU cache and changed the benchmarking tool to use the same data in consecutive iterations. However, this had little effect on the execution time of the benchmark CUDA kernels: increasing the number of iterations still did not change the execution time significantly.

I would appreciate any help on what could be the factors that make PyTorch CUDA kernels run faster with the increasing number of iterations.

I am using nvprof tool to measure CUDA kernels execution time. This tool can also measure a lot of other kernel performance parameters, but which metrics should I look at?

I profiled Pytorch and the benchmark tool using all metrics for one CUDA kernel that has a particularly large difference in execution time.

I profiled the two applications for three different numbers of iterations: 1, 10 and 100. In all three cases, the numbers of the kernel invocations are the same for the two apps.

I used all metrics (--metrics all nvprof option) for this kernel. Almost all metrics are about the same for the Pytorch applications and the benchmark tool, and only dram_write_* metrics are different.

I attach the plots showing these metrics for each application.
Pytorch metrics are orange, the benchmark metrics are blue. Coloured areas show the min and the max of the metric values, the lines show the average values as reported by nvprof. The x-axis is the number of iterations.

So the PyTorch kernel has much more DRAM write transactions, and the kernel execution time is faster. What could it mean?

I don’t think pytorch uses cudnn but more cultass which would explain this result
this is a simpler example of how pytorch should be constructed GitHub - NVlabs/tiny-cuda-nn: Lightning fast C++/CUDA neural network framework

pytorch most certainly uses cudnn

1 Like

Hi @pyotr777

I am not fully aware of how PyTorch is fully constructed. I have only had a rough view of the source code from some days ago. However, I hope this helps. In general, for CUDA kernels, what we first look at is the following:

  • Keeping the memory reserved from the beginning until the end. It means that no allocations should happen during the actual program execution.
  • Keeping the buffers in cache or, at least, in a fast memory region. It means that the data should be as less strided as possible to avoid cache trashing in the SMs.
  • Not sure if it is still valid, but I think so. Making the memory accesses coalesced. I think that may not be your issue here.

Trying to correlate the DRAM bandwidth with respect to the plots:

  • Making your memory pinned or properly mapped is always a good point
  • Hiding communication with processing: performing memory transfers while computing on a different region of memory

Hope this can help you to understand a bit what’s under the hood.

Regards,
Leon

Thank you @luis.leon for your valuable comment.

I have found that the main reason why Pytorch kernels are faster is the GPU “warmup” effect. Pytorch runs the whole CNN model training, and GPU starts to run faster (increasing application clock rate, I believe). On the other hand, benchmark runs short tasks of benchmarking individual training operations, which have little effect on GPU performance.