help! is there a mutex in cuda?

hello, I’m new to cuda, now, I got an algorithm, which each thread calculate a cost value (float type), and I want to use a global memory to record the minimal value. so something I want is:

__gloabl__ void work(...)


     // get information and data by the block and grid coordingates...

    float cost = ...... // calculate the float cost;


     if (cost < memory.min)

         memory.min = cost;



Does anyone who can give me any idea on how to implement this in cuda?

If memory.min is in global memory and and an integer, and your GPU has compute capability 1.1, you can use atomics. Otherwise use a reduction approach - look at the scan sample in the SDK. Also, check out the Supercomputing 07 CUDA presentation, it discusses reduction optimization in greate detail.


Excuse me, but isn’t the reduction sample more appropriate? Also, isn’t reduction faster than atomics?

thanks for your suggest, but question is, my memory.min is float , and maybe an fixed sized array, to record the min 10 value, so I cannot use atomics. In fact, my critical region has more complex operation:

  1. memory.size is float*, which represent an array of float, maxsize is 10

  2. in critical region, I want to first check if 10 entries is filled, then if not full, add value into that array, and increase size count.

so I don’t think I can count on atomics. well, I’ll look at reduction optimization, thanks again.

I haven’t compared the two, so I’m not sure. If you have perf results it’d be interesting to hear.


To soloman:
Int min can be used to compute float min. They’re even equivalent when non-negative.
Also, if you want the 10 smallest value, you can use the partition based O(n) algorithm on the costs. It may be more efficient than maintaining a heap, even on CPU.

To paulius:
Sadly, I’m not using a 1.1 capable card. My colleague’s experience seemed to imply reduction to be faster, but he never had the time to do a descent perf.

I’m planning to add a new kernel to the “reduction” sample for CUDA 1.1 cards that will use atomics (one per thread block) rather than launching recursive blocks. I’m assuming that with only one atomic op per thread block it will be fairly efficient, but I haven’t tested it yet.

The drawback is of course that we don’t have floating point atomics in current hardware. (Floating point atomics raise interesting questions since floating point arithmetic is non-associative.)

As for whether atomics are “slower than reductions”: if every thread tries to atomically add to the same value, that would definitely be slow (you are basically serializing the computation at the memory interface). But if you do a parallel reduction in shared memory, combined with an atomic per block, results should be much better. The lowest cost algorithm will always be a combination of data-parallel and serial computation. Intuitively: data-parallel computation reduces the number of steps of computation, while serial computation reduces the number of processors required. Cost = steps * processors. Formally: this is Brent’s Theorem (or at least a corollary to it). :)


The only things you need to program mutex functionality is atomic reads and writes. So you could try to implement it yourself.

However I suspect that it will be awfully slow.