How to implement semaphores in CUDA? I tried with atomicExch(), but got a Run-time error


I tried to implement a semaphore in CUDA, in order to get a common global variable updated by all threads, with the following code:

unsigned int green=1;

while( green==1 ) {

	green = atomicExch(&SEMAPHORE, 1);

}	// Wait for green light...

MAE += (float)fabs(Error);

RMSE += (float)(Error*Error);

ENF ++;

atomicExch( &SEMAPHORE, 0 );	// Set semaphore back to GREEN

but get a run-time error as soon as I let it run multi-thread.

In this code, the variable “green” is set to 0 only if SEMAPHORE was 0; all other threads, finding SEMAPHORE=1, keep polling.

Any ideas?

What is the right way to implement a semaphore (contended by all threads) in CUDA?

Thank you in advance,


Don’t do that. Really, don’t do that. It’s a terrible idea from a performance and “don’t deadlock the card” standpoint.

OK. I know, it’s a terrible idea for the performance. But it should run, isn’t it?

Anyway, how else should I do it then?

I have to update a matrix of float in parallel and in an irregular way (ray tracing).

How should I approach this update?


I’d also be interested in how to properly implement a semaphore/mutex with atomic instructions in CUDA. I need to update a global data structure for something like 0.1% or less of the pixels in a megapixel frame, and I suspect atomics wouldt be more efficient than collecting the results using a scan operation. How expensive are atomic operations?


If you have no 2 threads writing to the same memory location I believe that they are not that expensive. I have never really bench marked it though.

Scan is very fast btw. You probably have to try both implementations and benchmark it for your specific case.

Someone benchmarked coalesced atomic operations at 1/4 speed compared to normal coalesced writes, IIRC. I’m to lazy to run the benchmark myself right now. Though, 98% of the time one is using atomics is in uncoalesced situations anyway and who knows, the speed there may be little different.

Edit: I can confirm that atomics with few collisions are pretty efficient. I was just writing some code the other day for a “sort-of” histogram using atomics. With completely random data, the code ran in 0.5ms. With the data sorted so all ~20 items in each bin are in the same block, it ran in 4ms.

I seem to recall that atomics are always uncoalesced.