I’m now developing a CUDA program where there are lots of loops. At first, I split the procedure into severl single kernels and call them from host by a loop. But today I try to move those loops into the kernel, and I find it’s much faster than before. So is this result reliable? And why?
Anyway, although the speed increases, I find the code becomes hard to understand.
It is possible, that the call of the kernel require more time as the execution…
Additionaly, L1 and L2-cache (for Fermi) need to be rebuild for each kernle call.
It is possible, that the call of the kernel require more time as the execution…
Additionaly, L1 and L2-cache (for Fermi) need to be rebuild for each kernle call.
In my case, looping inside the kernel function resulted in a time out error when the N was to big. I moved the loop to the host function and no time out ocurred. Also I got the same performance for small cases with the loop in the host.
In my case, looping inside the kernel function resulted in a time out error when the N was to big. I moved the loop to the host function and no time out ocurred. Also I got the same performance for small cases with the loop in the host.