help! is there a mutex in cuda?

soloman · November 25, 2007, 2:32pm

hello, I’m new to cuda, now, I got an algorithm, which each thread calculate a cost value (float type), and I want to use a global memory to record the minimal value. so something I want is:

__gloabl__ void work(...)

{

     // get information and data by the block and grid coordingates...

    float cost = ...... // calculate the float cost;

    lock(memory);

     if (cost < memory.min)

         memory.min = cost;

     unlock(memory);

}

Does anyone who can give me any idea on how to implement this in cuda?

paulius · November 25, 2007, 9:48pm

If memory.min is in global memory and and an integer, and your GPU has compute capability 1.1, you can use atomics. Otherwise use a reduction approach - look at the scan sample in the SDK. Also, check out the Supercomputing 07 CUDA presentation, it discusses reduction optimization in greate detail.

Paulius

asadafag · November 26, 2007, 2:51am

Excuse me, but isn’t the reduction sample more appropriate? Also, isn’t reduction faster than atomics?

soloman · November 26, 2007, 3:19am

thanks for your suggest, but question is, my memory.min is float , and maybe an fixed sized array, to record the min 10 value, so I cannot use atomics. In fact, my critical region has more complex operation:

memory.size is float*, which represent an array of float, maxsize is 10
in critical region, I want to first check if 10 entries is filled, then if not full, add value into that array, and increase size count.

so I don’t think I can count on atomics. well, I’ll look at reduction optimization, thanks again.

paulius · November 26, 2007, 7:27pm

I haven’t compared the two, so I’m not sure. If you have perf results it’d be interesting to hear.

Paulius

asadafag · November 27, 2007, 2:51am

To soloman:
Int min can be used to compute float min. They’re even equivalent when non-negative.
Also, if you want the 10 smallest value, you can use the partition based O(n) algorithm on the costs. It may be more efficient than maintaining a heap, even on CPU.

To paulius:
Sadly, I’m not using a 1.1 capable card. My colleague’s experience seemed to imply reduction to be faster, but he never had the time to do a descent perf.

Mark_Harris · November 29, 2007, 10:25pm

I’m planning to add a new kernel to the “reduction” sample for CUDA 1.1 cards that will use atomics (one per thread block) rather than launching recursive blocks. I’m assuming that with only one atomic op per thread block it will be fairly efficient, but I haven’t tested it yet.

The drawback is of course that we don’t have floating point atomics in current hardware. (Floating point atomics raise interesting questions since floating point arithmetic is non-associative.)

As for whether atomics are “slower than reductions”: if every thread tries to atomically add to the same value, that would definitely be slow (you are basically serializing the computation at the memory interface). But if you do a parallel reduction in shared memory, combined with an atomic per block, results should be much better. The lowest cost algorithm will always be a combination of data-parallel and serial computation. Intuitively: data-parallel computation reduces the number of steps of computation, while serial computation reduces the number of processors required. Cost = steps * processors. Formally: this is Brent’s Theorem (or at least a corollary to it). :)

Mark

seb · November 29, 2007, 10:53pm

The only things you need to program mutex functionality is atomic reads and writes. So you could try to implement it yourself.

However I suspect that it will be awfully slow.

Topic		Replies	Views
how to find the min value CUDA Programming and Performance	2	3745	October 25, 2008
Best way to find many minimums CUDA Programming and Performance	8	2420	January 3, 2018
Finding minimum among multiple threads CUDA Programming and Performance	13	5278	August 11, 2013
Global variables ? OptiX	4	1288	June 14, 2022
concurrent memory writes CUDA Programming and Performance	8	5553	September 15, 2008
Floating atomicMin() help!!! CUDA Programming and Performance	6	1381	May 13, 2014
Concurrent writing to a global variable CUDA Programming and Performance	10	2345	December 7, 2013
atomic min/max for real data Legacy PGI Compilers	8	19969	February 18, 2010
Fast min/max function in shared memory CUDA Programming and Performance	2	6940	February 24, 2010
problem with atomic operations on global memory implement mutual exclusion with atomicop CUDA Programming and Performance	1	1587	February 26, 2008

help! is there a mutex in cuda?

Related topics