loops inside kernel or call from host

I’m now developing a CUDA program where there are lots of loops. At first, I split the procedure into severl single kernels and call them from host by a loop. But today I try to move those loops into the kernel, and I find it’s much faster than before. So is this result reliable? And why?

Anyway, although the speed increases, I find the code becomes hard to understand.

[codebox]global void device_func()

{

for i=0 to N

    do something;

}

or

void host_func()

{

for i=0 to N

    call device_func();

}[/codebox]

It is possible, that the call of the kernel require more time as the execution…
Additionaly, L1 and L2-cache (for Fermi) need to be rebuild for each kernle call.

It is possible, that the call of the kernel require more time as the execution…
Additionaly, L1 and L2-cache (for Fermi) need to be rebuild for each kernle call.

In my case, looping inside the kernel function resulted in a time out error when the N was to big. I moved the loop to the host function and no time out ocurred. Also I got the same performance for small cases with the loop in the host.

In my case, looping inside the kernel function resulted in a time out error when the N was to big. I moved the loop to the host function and no time out ocurred. Also I got the same performance for small cases with the loop in the host.

Vanio, your problem is related to the 5 sec timeout for kernels run on a device which is also used to render display I guess.

Vanio, your problem is related to the 5 sec timeout for kernels run on a device which is also used to render display I guess.

Yes. I belive that is the reason :)

Yes. I belive that is the reason :)