Factors affecting CUDA kernel performance

pyotr777 · August 11, 2021, 1:26am

I am working on a benchmarking tool that simulates PyTorch convolutional layers. PyTorch uses cuDNN for convolution operations so does the benchmarking tool.

As far as I can tell from cuDNN API logs and Nsight Systems traces there is no difference in cuDNN function calls in PyTorch and the benchmarking tool.
However, PyTorch convolutions are faster: increasing the number of iterations I can see that the execution time of PyTorch CUDA kernels decreases significantly. On the other hand, increasing the number of iterations in the benchmark does not significantly change CUDA kernels execution time.

I thought it could be the result of PyTorch kernels effectively using GPU cache and changed the benchmarking tool to use the same data in consecutive iterations. However, this had little effect on the execution time of the benchmark CUDA kernels: increasing the number of iterations still did not change the execution time significantly.

I would appreciate any help on what could be the factors that make PyTorch CUDA kernels run faster with the increasing number of iterations.

I am using nvprof tool to measure CUDA kernels execution time. This tool can also measure a lot of other kernel performance parameters, but which metrics should I look at?

pyotr777 · August 11, 2021, 9:48am

I profiled Pytorch and the benchmark tool using all metrics for one CUDA kernel that has a particularly large difference in execution time.

I profiled the two applications for three different numbers of iterations: 1, 10 and 100. In all three cases, the numbers of the kernel invocations are the same for the two apps.

I used all metrics (--metrics all nvprof option) for this kernel. Almost all metrics are about the same for the Pytorch applications and the benchmark tool, and only dram_write_* metrics are different.

I attach the plots showing these metrics for each application.
Pytorch metrics are orange, the benchmark metrics are blue. Coloured areas show the min and the max of the metric values, the lines show the average values as reported by nvprof. The x-axis is the number of iterations.

So the PyTorch kernel has much more DRAM write transactions, and the kernel execution time is faster. What could it mean?

lee.deepm · August 12, 2021, 5:26pm

I don’t think pytorch uses cudnn but more cultass which would explain this result
this is a simpler example of how pytorch should be constructed GitHub - NVlabs/tiny-cuda-nn: Lightning fast C++/CUDA neural network framework

Robert_Crovella · August 12, 2021, 5:32pm

pytorch most certainly uses cudnn

luis.leon · August 23, 2021, 4:17pm

Hi @pyotr777

I am not fully aware of how PyTorch is fully constructed. I have only had a rough view of the source code from some days ago. However, I hope this helps. In general, for CUDA kernels, what we first look at is the following:

Keeping the memory reserved from the beginning until the end. It means that no allocations should happen during the actual program execution.
Keeping the buffers in cache or, at least, in a fast memory region. It means that the data should be as less strided as possible to avoid cache trashing in the SMs.
Not sure if it is still valid, but I think so. Making the memory accesses coalesced. I think that may not be your issue here.

Trying to correlate the DRAM bandwidth with respect to the plots:

Making your memory pinned or properly mapped is always a good point
Hiding communication with processing: performing memory transfers while computing on a different region of memory

Hope this can help you to understand a bit what’s under the hood.

Regards,
Leon

pyotr777 · September 2, 2021, 2:34am

Thank you @luis.leon for your valuable comment.

I have found that the main reason why Pytorch kernels are faster is the GPU “warmup” effect. Pytorch runs the whole CNN model training, and GPU starts to run faster (increasing application clock rate, I believe). On the other hand, benchmark runs short tasks of benchmarking individual training operations, which have little effect on GPU performance.

Topic		Replies	Views
Performance Difference Between Custom CUDNN Wrapper and PyTorch cuDNN	3	120	June 24, 2025
Running PyTorch CUDA Jetson Nano pytorch	8	2260	July 13, 2022
Anomalous time gaps observed when using CUDA kernels in PyTorch CUDA Programming and Performance	2	124	August 5, 2024
Torch Tensor.cuda() very slow Jetson TX2 pytorch	6	3488	October 18, 2021
PyTorch First-time Inference Performance: Understanding the Overhead cuDNN cudnn	1	65	July 31, 2025
Pytorch 1.2 cuda 10.0 vs. pytorch 1.9 cuda 11.1 significant slowdown cuDNN	3	995	July 8, 2021
Uncoherent timing of convolution using CUDA events CUDA Programming and Performance	3	403	July 27, 2023
Tesla V100 GPU way too slow CUDA Programming and Performance	8	6689	December 21, 2017
Whether it is worthwhile to create cuda C for CNN or whether torch.cuda is sufficient cuDNN cuda , kernel , python	1	432	May 31, 2023
Does NVidia know about the 300% perf improvement cuDNN can provide? CUDA Programming and Performance cuda	6	4205	November 4, 2023

Factors affecting CUDA kernel performance

Related topics