Is there any way to make a critical section in CUDA to syncronize a concurrent global memory write without having to use a reduction operation?
Imagine I want to find the greatest number from a 2M number array… and I don’t dont dont dont want to perform a reduction >< I would need something like a critical section to update the greatest value found.
on compute 1.1 hardware you can use the atomic* operations, though only for integers. But in your case of finding the maximum, using the reduction will be much faster than using only atomic operations. A combination of reduction w/ atomic operations after the block reduction could potentially increase the performance since another kernel pass would not be required.
As many of us have been down this road before, sadly the answer is no, but there is a very efficient reduction code on the gpgpu site for cuda (which guys from nvidia wrote). a global sync threads isn’t possible because of hardware limitations, and will result in a dead lock.
since it is possible to call the kernels asynchronously (meaning fire and forget), the cost of calling more kernels is pretty low.
Maybe some day they will add a global sync … but don’t count on it ;)
Reduction is pretty cool, much cooler than I thought. It’s fast, you’d be surprised. I would like to point out the very nice CUDPP library – this will do reductions for you.
Even if you could do a global sync, among all threads it would in effect serialize all ~10,000 threads you are running and remove any benefit from thread interleaving of the data-parallel architecture.
Yes well some times you have to do this, i have to do it 300 times in my solver. And since the alternative is launching 300 kernels, for a total compute time of less then 8ms any alternative that would cost less would be welcome. even if it removes the thread interleaving at that specific point.