What could be possible reasons for affecting the kernel launch overhead for fast small kernels?

I have noticed a strange problem, that I run my same code on my RTX 3090. This time is half an year later than the last running. But I found that the performance results get average 1/3 slower than the results obtained half an year before. I am running the same CUDA code with the same script.

Actually, I found that for datasets that leads to longest execution time, they are relatively stable as the previous results. However, for datasets that previously runs very fast, it get slower now.

Does it might because of the variation of the kernel launch overhead?
The timer I am using for the kernels are cudaEventSynchronization().

What approximate running times do your small kernels have?
Are we talking about 12µs instead of 9µs or 400µs instead of 300µs?

Things that could have changed:

  • Cooling efficiency; remove any dust from the fans
  • Driver version
  • Operating system version / kernel version
  • Any other hardware upgrades? Especially ones that could change the available PCIe lanes
  • Any additional software running in the background and taking CPU (or GPU) resources?

I mean its ms level.
My current driver version is 545.23.08.
I cannot remember the version number for my old one half a year before.

For one kernel? That is a lot. If there is a difference between small and larger kernels, I would guess it has to do with power modes and downclocking during idle?

Does it improve on average, if you run the same small kernel in a loop for a few seconds?

Yes, for one kernel.
How to set power modes and downclocking during idle?
Would you please share some commands?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.