I have noticed a strange problem, that I run my same code on my RTX 3090. This time is half an year later than the last running. But I found that the performance results get average 1/3 slower than the results obtained half an year before. I am running the same CUDA code with the same script.
Actually, I found that for datasets that leads to longest execution time, they are relatively stable as the previous results. However, for datasets that previously runs very fast, it get slower now.
Does it might because of the variation of the kernel launch overhead?
The timer I am using for the kernels are cudaEventSynchronization().
For one kernel? That is a lot. If there is a difference between small and larger kernels, I would guess it has to do with power modes and downclocking during idle?
Does it improve on average, if you run the same small kernel in a loop for a few seconds?