I have a base kernel and when I place a loop around that base code inside the kernel, I get different timings every time I run the kernel.
But, I don’t get different timings when running the base kernel (without the loop inside). Is there any issue with loops in cuda? number of iterations are constant.