How to implement semaphores in CUDA? I tried with atomicExch(), but got a Run-time error

Fede1 · January 30, 2009, 2:55pm

Hi,

I tried to implement a semaphore in CUDA, in order to get a common global variable updated by all threads, with the following code:

unsigned int green=1;

while( green==1 ) {

	green = atomicExch(&SEMAPHORE, 1);

}	// Wait for green light...

MAE += (float)fabs(Error);

RMSE += (float)(Error*Error);

ENF ++;

atomicExch( &SEMAPHORE, 0 );	// Set semaphore back to GREEN

but get a run-time error as soon as I let it run multi-thread.

In this code, the variable “green” is set to 0 only if SEMAPHORE was 0; all other threads, finding SEMAPHORE=1, keep polling.

Any ideas?

What is the right way to implement a semaphore (contended by all threads) in CUDA?

Thank you in advance,

Federico

tmurray · January 30, 2009, 5:15pm

Don’t do that. Really, don’t do that. It’s a terrible idea from a performance and “don’t deadlock the card” standpoint.

Fede1 · January 30, 2009, 5:55pm

OK. I know, it’s a terrible idea for the performance. But it should run, isn’t it?

Anyway, how else should I do it then?

I have to update a matrix of float in parallel and in an irregular way (ray tracing).

How should I approach this update?

Federico

lars · January 31, 2009, 10:57pm

I’d also be interested in how to properly implement a semaphore/mutex with atomic instructions in CUDA. I need to update a global data structure for something like 0.1% or less of the pixels in a megapixel frame, and I suspect atomics wouldt be more efficient than collecting the results using a scan operation. How expensive are atomic operations?

/Lars

E.D_Riedijk · February 1, 2009, 2:27am

If you have no 2 threads writing to the same memory location I believe that they are not that expensive. I have never really bench marked it though.

Scan is very fast btw. You probably have to try both implementations and benchmark it for your specific case.

MisterAnderson42 · February 1, 2009, 1:49pm

Someone benchmarked coalesced atomic operations at 1/4 speed compared to normal coalesced writes, IIRC. I’m to lazy to run the benchmark myself right now. Though, 98% of the time one is using atomics is in uncoalesced situations anyway and who knows, the speed there may be little different.

Edit: I can confirm that atomics with few collisions are pretty efficient. I was just writing some code the other day for a “sort-of” histogram using atomics. With completely random data, the code ran in 0.5ms. With the data sorted so all ~20 items in each bin are in the same block, it ran in 4ms.

tmurray · February 2, 2009, 6:00pm

I seem to recall that atomics are always uncoalesced.