I tried to implement a semaphore in CUDA, in order to get a common global variable updated by all threads, with the following code:
unsigned int green=1;
while( green==1 ) {
green = atomicExch(&SEMAPHORE, 1);
} // Wait for green light...
MAE += (float)fabs(Error);
RMSE += (float)(Error*Error);
ENF ++;
atomicExch( &SEMAPHORE, 0 ); // Set semaphore back to GREEN
but get a run-time error as soon as I let it run multi-thread.
In this code, the variable “green” is set to 0 only if SEMAPHORE was 0; all other threads, finding SEMAPHORE=1, keep polling.
Any ideas?
What is the right way to implement a semaphore (contended by all threads) in CUDA?
I’d also be interested in how to properly implement a semaphore/mutex with atomic instructions in CUDA. I need to update a global data structure for something like 0.1% or less of the pixels in a megapixel frame, and I suspect atomics wouldt be more efficient than collecting the results using a scan operation. How expensive are atomic operations?
Someone benchmarked coalesced atomic operations at 1/4 speed compared to normal coalesced writes, IIRC. I’m to lazy to run the benchmark myself right now. Though, 98% of the time one is using atomics is in uncoalesced situations anyway and who knows, the speed there may be little different.
Edit: I can confirm that atomics with few collisions are pretty efficient. I was just writing some code the other day for a “sort-of” histogram using atomics. With completely random data, the code ran in 0.5ms. With the data sorted so all ~20 items in each bin are in the same block, it ran in 4ms.