What could be possible reasons for affecting the kernel launch overhead for fast small kernels?

zwu065 · October 22, 2024, 6:49am

I have noticed a strange problem, that I run my same code on my RTX 3090. This time is half an year later than the last running. But I found that the performance results get average 1/3 slower than the results obtained half an year before. I am running the same CUDA code with the same script.

Actually, I found that for datasets that leads to longest execution time, they are relatively stable as the previous results. However, for datasets that previously runs very fast, it get slower now.

Does it might because of the variation of the kernel launch overhead?
The timer I am using for the kernels are cudaEventSynchronization().

Curefab · October 22, 2024, 9:05am

What approximate running times do your small kernels have?
Are we talking about 12µs instead of 9µs or 400µs instead of 300µs?

Things that could have changed:

Cooling efficiency; remove any dust from the fans
Driver version
Operating system version / kernel version
Any other hardware upgrades? Especially ones that could change the available PCIe lanes
Any additional software running in the background and taking CPU (or GPU) resources?

zwu065 · October 22, 2024, 9:25am

I mean its ms level.
My current driver version is 545.23.08.
I cannot remember the version number for my old one half a year before.

Curefab · October 22, 2024, 9:35am

For one kernel? That is a lot. If there is a difference between small and larger kernels, I would guess it has to do with power modes and downclocking during idle?

Does it improve on average, if you run the same small kernel in a loop for a few seconds?

zwu065 · October 22, 2024, 9:48am

Yes, for one kernel.
How to set power modes and downclocking during idle?
Would you please share some commands?

system · November 5, 2024, 9:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
First kernel execution takes longer CUDA Programming and Performance	8	2855	December 8, 2014
CUDA Graphs Impact CUDA Programming and Performance	2	472	September 17, 2021
faster at small runtimes, slower for larger runtimes CUDA Programming and Performance	1	754	June 4, 2010
Small kernels are slow... CUDA Programming and Performance	1	1243	December 14, 2009
Can too many kernel calls affect the performance ? CUDA Programming and Performance	4	980	June 10, 2010
Load balancing Cuda contexts CUDA Programming and Performance	9	2492	November 9, 2009
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1435	July 17, 2017
Strange Performance Issues Strange Performance Issues at the First Kernel Execution CUDA Programming and Performance	1	838	August 8, 2009
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13738	March 12, 2010
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6214	May 26, 2009

What could be possible reasons for affecting the kernel launch overhead for fast small kernels?

Related topics