concurrent memory writes

santyhyammer · May 15, 2008, 1:00am

Is there any way to make a critical section in CUDA to syncronize a concurrent global memory write without having to use a reduction operation?

Imagine I want to find the greatest number from a 2M number array… and I don’t dont dont dont want to perform a reduction >< I would need something like a critical section to update the greatest value found.

thx

MisterAnderson42 · May 15, 2008, 1:51am

on compute 1.1 hardware you can use the atomic* operations, though only for integers. But in your case of finding the maximum, using the reduction will be much faster than using only atomic operations. A combination of reduction w/ atomic operations after the block reduction could potentially increase the performance since another kernel pass would not be required.

santyhyammer · May 15, 2008, 8:29pm

Well, to keep my app compatible with the old G80 card I cannot use atomic operations really.

About the reduction I just wanted to skip it if possible… to code less lines :D

I have a kernel done with reductions but I was wondering if could be done without them just with __syncthreads() and some clever tricks.

Eri_Rubin · May 16, 2008, 6:07am

As many of us have been down this road before, sadly the answer is no, but there is a very efficient reduction code on the gpgpu site for cuda (which guys from nvidia wrote). a global sync threads isn’t possible because of hardware limitations, and will result in a dead lock.

since it is possible to call the kernels asynchronously (meaning fire and forget), the cost of calling more kernels is pretty low.

Maybe some day they will add a global sync … but don’t count on it ;)

kristleifur · May 16, 2008, 9:59am

Reduction is pretty cool, much cooler than I thought. It’s fast, you’d be surprised. I would like to point out the very nice CUDPP library – this will do reductions for you.

http://www.gpgpu.org/developer/cudpp/

MisterAnderson42 · May 16, 2008, 12:56pm

Even if you could do a global sync, among all threads it would in effect serialize all ~10,000 threads you are running and remove any benefit from thread interleaving of the data-parallel architecture.

Mark_Harris · May 22, 2008, 10:20am

Actually, CUDPP does not yet support plain-old reductions, it only supports scans (prefix sum). This is a bit of an expensive way to a reduction.

I’m currently working on adding fast reduction support to CUDPP.

Thanks for the compliment!

Mark

kristleifur · May 23, 2008, 1:28pm

Oh, duh, yes, I was talking about compaction. Sorry for the misleading comment. The compliment stands :)

Eri_Rubin · September 15, 2008, 12:50pm

Yes well some times you have to do this, i have to do it 300 times in my solver. And since the alternative is launching 300 kernels, for a total compute time of less then 8ms any alternative that would cost less would be welcome. even if it removes the thread interleaving at that specific point.