Hi all,
I’m implementing a parallel version of a simple search algorithm and I need to find a way to block the execution of the entire kernel if a thread find a match.
I need to exit from the kernel but I don’t want to clean resource, becouse I need to check the boolean value for the match.
Consider that the host thread is blocked waiting for the completition of the kernel.

Best regards,

Synchronization or communication between blocks isn’t something that CUDA can do easily, but you might be able to achieve what you are looking for by using global memory atomic operations on a status variable.

If you’re just trying to end the search when you find a valid result, I’d just run the kernel over and over again from the host, each time checking if the a valid result was found. If a valid result was found, stop running the kernels, if it wasn’t, continue.


in this way probably I get to much overhead in moving data from host to device - run the kernel - move the result from device to host too many times…

I also have to minimize transfers becouse I run on an old PCI bus …


I don’t need an atomic (consistent) access to memory. I need to stop the exection … am I interpreting it badly?

Now I’m checking, for each thread at the beginning of the body of the kernel, if the match was just found in past executions.

Also here I don’t need synchronizzation 'couse I don’t need any consistency.

Best regards


The main problem you have is that global memory writes are “fire and forget” and have a lot of latency. It could be a few thousand clock cycles before the signal from a running thread indicating to the others they should stop actually gets written into global memory for others to see. In that time, other threads continue to run, and new blocks might also be scheduled. If your algorithm requires an “accurate” termination, then the only way to achieve it is through a combination of block level synchronization (which is easy) and atomic access to the global location holding the terminate flag. If you don’t care about accurate termination, then your current method is about the only “clean” way to do it. A dirty way to do it would be to have the “winning” thread do something absolutely guaranteed to generate a kernel abort, like access a totally invalid memory address which will make the entire kernel crash.

Hi, I don’t really require accurate termination, I need only maximun performance … I think a synch access and the relative fast termination of threads can be compared to unsynch access and delayed termination…I’ll do some test.

Best regards


Hi, I’ve used atomicAdd and it’s better becouse I don’t have uncoalesced store anymore … the performance is the same.
Now I’ve to remove some uncoalesced load …
Here is the pseudocode of the kernel

//index on the array of coords of the cluster of ESTs
int idx=blockIdx.x*blockDim.x+threadIdx.x;

if(bs(d_el, d_cl[idx], 0, sizeE-1))
atomicAdd(test, 1);

d_el is the array where I need to search the value that each thread load from d_cl[idx].
So I’m executing many binary search in parallel.
I think that the access to d_el is cousing the uncloasced access that I have…
Maybe I can use shared mem?

best regards

Sorry I’ve forget to say that I’m on a card with compute copability 1.1 (9400GT), bs is obviously a kernel and the array is an array of int…

best regards


Hi , it´s the first time that i post on nvidia forum so is probably the rong place to post this problem but here it goes any way.

i have recently install window 7 in my computer, i have an Geforce 9500 GT, i have update the same to the last drive for windows 7 32 bits, but i keep having random graphic crashes by a kernel mode driver problem.

Please, can anyone help me?

tks guys