kernel execution in FOR loops

Sorry for another novice question, but was wondering if it is possible to execute a kernel in a for loop, N times:

//kernel
global void get_errorVector(float *d_Ev,float *d_Pl,float *d_f) {
int i = threadIdx.x;

d_Ev[i] = d_f[i]+d_Pl[i];

__syncthreads();

}

can this be repeated many times like this:

for(int i = 0;i<N;i++)
get_errorVector<<<1,2*N+1>>>(d_Ev,d_Pl,d_f);

Thanks alot for your help.

K

You can do that, however it wastes a lot of time by calling kernel function.
In my past experimentations, the for…loop inside a kernel more effectible.

Thanks for your reply Quoc. For some reason I get errors when executing the kernel in the for loop. I’ll try to see if the for loop can go in the kernel! It might not be possible with my algorithm.

I just had to switch to putting a loop within the kernel…when the loop was on the CPU it was way too slow.

They say that kernel invocation is not expensive, but on the GTX260M / Win7/64, it seems to be much more

expensive than running on the GPU when in a tight loop. The problem I am now having is how to synchronize

all the thread blocks within the kernel…there seems to be some academic papers on this, but my efforts have

not succeeded…

Kernel invocation is not expensive, but don’t you think running a for loop many times is a time consuming process? And cant we use __syncthread(): to synchronise the thread blocks? What do you think about this? By the way I am not a CUDA expert.

__syncthreads() is a block-wide sync, not kernel-wide.

Kernel invocation has an overhead of tens of microseconds, which is negligible unless each kernel call is very short. Pushing a for loop from the host onto the device can be beneficial for very short kernels, but for longer ones doesn’t necessarily make any difference.

Shooting for a kernel duration of > 10 milliseconds makes the launch overhead negligible, and keeping it < 2 seconds ensures you don’t trip the watchdog timer.

You mean I can synchronise the threads belonging to the same block, but it is not possible to sync threads which result from different invocation of kernels? Ok! But then is there any way to sync threads kernel-wide?

That’s correct,

There’s no easy way to sync threads kernel-wide other than to wait until the kernel completes. There’s no elegant or fast method to do this from within the kernel. There are some hacks that one could try but I don’t know them really.