i created a certain function in cuda, and put this function in a “for” loop to test the actual time took. the time of the first 5 loops was only 30ms, while the loops after that took more than 120ms, i want to know the reason answer for the difference.
i wrote 4 functions in cuda, each of them took no more than 20ms, the sum time of the 4 functions was only 35ms, while the actual total time for these 4 functions was more than 130ms. would anyone help to explain the reason why ?
- errors in timing measurement
- GPU overheating, clocks slowing down significantly
- data-dependent code paths, resulting in extra work in later iterations
- other work being done on the GPU besides what is being measured
I imagine there are other possibilities as well. Guesswork like this is rarely on-target in my experience.
after making sure that there are no runtime errors being reported and no illegal activity (e.g., check with compute-sanitizer
), usually a profiler (e.g. nsight systems) can help to understand reasons for various performance observations.
thank you for your answer.
functions like “cudaEventElapsedTime” were used in timing measurement, i don’t know is there any better method.
overheating could be the reason, but the room temperature was only 28°.
the input and output matrix for these functions were the same, so they were data-independent in the iterations.
i haven’t create other threads or calculations at the same time, so it was really wired to me.