kernel execution in FOR loops

kartrenaka · December 28, 2009, 2:49am

Sorry for another novice question, but was wondering if it is possible to execute a kernel in a for loop, N times:

//kernel
global void get_errorVector(float *d_Ev,float *d_Pl,float *d_f) {
int i = threadIdx.x;

d_Ev[i] = d_f[i]+d_Pl[i];

__syncthreads();

}

can this be repeated many times like this:

for(int i = 0;i<N;i++)
get_errorVector<<<1,2*N+1>>>(d_Ev,d_Pl,d_f);

Thanks alot for your help.

K

Quoc_Vinh · December 28, 2009, 9:17am

You can do that, however it wastes a lot of time by calling kernel function.
In my past experimentations, the for…loop inside a kernel more effectible.

kartrenaka · December 28, 2009, 2:11pm

Thanks for your reply Quoc. For some reason I get errors when executing the kernel in the for loop. I’ll try to see if the for loop can go in the kernel! It might not be possible with my algorithm.

diffent · January 5, 2010, 6:22pm

I just had to switch to putting a loop within the kernel…when the loop was on the CPU it was way too slow.

They say that kernel invocation is not expensive, but on the GTX260M / Win7/64, it seems to be much more

expensive than running on the GPU when in a tight loop. The problem I am now having is how to synchronize

all the thread blocks within the kernel…there seems to be some academic papers on this, but my efforts have

not succeeded…

cudacuda2009 · January 8, 2010, 11:24am

Kernel invocation is not expensive, but don’t you think running a for loop many times is a time consuming process? And cant we use __syncthread(): to synchronise the thread blocks? What do you think about this? By the way I am not a CUDA expert.

_Big_Mac · January 8, 2010, 2:29pm

__syncthreads() is a block-wide sync, not kernel-wide.

seibert · January 8, 2010, 5:15pm

Kernel invocation has an overhead of tens of microseconds, which is negligible unless each kernel call is very short. Pushing a for loop from the host onto the device can be beneficial for very short kernels, but for longer ones doesn’t necessarily make any difference.

Shooting for a kernel duration of > 10 milliseconds makes the launch overhead negligible, and keeping it < 2 seconds ensures you don’t trip the watchdog timer.

cudacuda2009 · January 9, 2010, 6:59am

You mean I can synchronise the threads belonging to the same block, but it is not possible to sync threads which result from different invocation of kernels? Ok! But then is there any way to sync threads kernel-wide?

_Big_Mac · January 9, 2010, 2:34pm

That’s correct,

There’s no easy way to sync threads kernel-wide other than to wait until the kernel completes. There’s no elegant or fast method to do this from within the kernel. There are some hacks that one could try but I don’t know them really.

Topic		Replies	Views
loop inside kernel CUDA Programming and Performance	9	7600	May 4, 2009
loop inside a kernel How many interrations? CUDA Programming and Performance	3	3195	July 20, 2009
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4569	February 6, 2009
thread local 'for loop' question thread parallel for loop execution CUDA Programming and Performance	5	3388	August 29, 2007
Kernels and For Loops CUDA Programming and Performance	2	4077	April 4, 2008
Synchronization between Kernel calls CUDA Programming and Performance	2	2740	July 4, 2011
'for' loop performance hacks? CUDA Programming and Performance	17	10513	February 28, 2009
Extremely high number of iterations CUDA Programming and Performance	5	1325	February 14, 2013
How big is the kernel invocation overhead? CUDA Programming and Performance	9	4992	December 17, 2008
Execute instruction only once inside a block/grid? CUDA Programming and Performance	7	2008	May 10, 2010

kernel execution in FOR loops

Related topics