Conditional in a loop in the kernel Execution stops when condition is true

Can a conditional be embedded in a for loop in the kernel?

If so how is it done?

Section Control Flow Instructions of the CUDA Programming Guide Version 2.1 lead me to believe I could use “normal” C flow control in a kernel. But when I try to nest a conditional within a for loop execution stops the first time the conditional is true in any thread.

In more detail:

I have an array of independent frames of data which share the same processing.

I process one frame in each thread using a for loop to cycle through all elements of the frame.

The current element processing depends on the previous value of the element.

With-in this loop there is a conditional.

If the element exceeds a limit, it is reset back to a starting value.

Each thread, ( or frame ), reaches the limit independently of the others.

The first frame to reach the limit causes execution to cease with the message:

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <>, line 476 : unspecified launch failure.

If I move the conditional to after the for loop, thus checking at the end of the frame processing rather than after each elements processing, it works. The values increase, each at its own rate. When one exceeds the limit it is reset.

Note: the kernel is launched multiple times. The data is in global memory and processing picks up where it left off. It takes multiple launches before any frame reaches the limit.

Thanks for any help you can offer.

[codebox] const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;


for ( int i = 0; i < N_FRAME; i++ )Â {




   // *** Conditional In Loop ***

   if ( Data[ tid * N_DATA + ACCUMULATOR_OFFSET ] >= Data[ tid * N_DATA + LIMIT_OFFSET ] ) {

       Data[ tid * N_DATA + ACCUMULATOR_OFFSET ] -= 1024.00f;




 // *** Conditional After Loop ***

 if ( Data[ tid * N_DATA + ACCUMULATOR_OFFSET ] >= Data[ tid * N_DATA + LIMIT_OFFSET ] ) {

     Data[ tid * N_DATA + ACCUMULATOR_OFFSET ] -= 1024.00f;


Unspecified launch failure = out of bound global memory read/write, 99% of the time.
I see no dependance on 'i’in the loop code so i guess its in the macro. So check for that.

There is no problem whatsoever with flow control within a loop.

To which I’ll add… run in emudebug mode with valgrind. This is a solution lots of others before me have suggested, and I believe a custom suppression file is available somewhere on the forum. You might have to recompile valgrind, to up the thread limit.

I want to thank everyone who responded to my post. I have solved my memory issues.
I also found that I had another problem.

In case anyone is interested, the first kernel launch seemed to have random data in the memory, even though I had initialized it. But subsequent launches worked. ( I use the data stored from the previous launch in the next one ) I use the jack audio control,, to patch audio streams to & from the sound card as well as between applications. My code registers a call back function with jack which calls it whenever the buffer needs refilling. CUDA saw this call as different thread so it gave it a different context. I had to move all the initialization into the function jack calls, using a flag to do the setup on the first call only.

Thanks again for pointing me to my memory problems.