Mutual exclusion or Reduction on global memory?

Hi guys,

I need ideas to resolve this problem

The thread’s writes sparsely to its pixel and nearby pixels with radius of 1. The issue is I need to ensure that no write conflicts exist.

Nearby pixels would do simultaneous writes to their neighbours, e.g. the blue area in between the 2 dark blue pixels.

I added a diagram for visualising my problem


  1. One thread has its own submatrix and do reduction
    Issue: Matrix size is 400x400 pixels
    So each thread would have its own 400x400
    That means the total size would be 400x400x400x400 which is ridiculous

  2. Each block writes to a shared sub matrix of size say 8x8 pixels
    Issue: How do we write to neighboring pixels in another block?

    Then each block copy its shared submatrix to the one in global memory

Help please

1.1 hardware and later supports atomics in global memory. This means 8800 GTX and Tesla S870 do not work, but many 8-series and all 9-series do. Consult the table here:…rocessing_units

Read the Programming Guide about the atomic operations and come here if you have more questions. (Section 4.4.4 and Appendix C)

Also read general programming stuff about how to use atomic operations to implement mutexs. You would probably use atomicCAS() (aka “compare and exchange”).

P.S. The new G200 can do atomics on shared memory, which may be useful to you since that will be much better performance.

Does each thread have its own unique center pixel, and will read and write the 3x3 pixels around it? Or can two threads have the SAME center pixel?

If it’s the first case, you can solve the problem by just doing a syncthreads() after each of the 9 read-update-writes. Since each thread has its own unique pixel, then none of the 9 writes will collide if they’re done in sync. So for example if all threads write to relative (-1, -1) at the same time, THOSE will be unique pixels as well. So no need for atomics or reduction.

If you have a chance for some threads to have the same center pixel as others, it’s trickier. I have a solution for that too, using a hash table and readback to do “massive parallel sparse atomics”… basically hash the pixel ID into an index. Write the thread ID into global memory with that index. Syncthreads. Read back the written data. If your thread ID is still there, you “won” the write collision battle, and it’s safe to do the first method above using all of the “winners”. If a thread “lost” the atomic battle, you could do those with another round (if there’s a lot) or with global atomics (if there’s only a few.)

Thanks for the replies guys

Yes, it has its own unique pixel and writes to 3x3 around it.

No, two threads can not have the same center pixel.

This is great, however, if the thread’s center pixel is at the edge of the image. Only the center of the [3,3] would be written to it. Am I right thinking that I can’t use __syncthreads() in this condition?

I’ll look into this as a possible synchronization method for blocks that contain edges that can’t syncthreads

Another issue I just thought of is what happens if Block A writes to a pixel in Block B AND at the same time that pixel is being written by a thread within the same Block B.

That means __syncthreads would fail to eliminate write conflicts right?

A solution that appeals to me is to have a NxN mutex where each location represents each pixel. What do you guys think of that?

You can test those edge conditions per-update. That’s going to be a very small cost. You do have to watch out for __syncthreads() inside a test, though… you can get undefined behavior (usually hangs) unless all threads of a block hit the same __syncthreads(). So what you do is test for the write bounds, then do the write, and OUTSIDE that test scope do your syncthreads().

So it looks like coordinating things so that threads IN A BLOCK don’t interfere is easy.

The problem of keeping different BLOCKS from interfering remains, but that has several solutions too.

Here you may know something about each block’s work (if each block is a region of the screen, maybe you know that they won’t interfere!).

If the blocks all have random pixels, you have to make sure they don’t step on each other. Perhaps you could solve it with a global mutex… using global atomics not per-thread, but per-block, which would have pretty low overhead. You can make a global mutex by intializing one global value to “1”, perhaps on the host before your kernel launch.

Your “update pixels” code will just do something like this:

/* Wait for global mutex. Only thread 0 of the block needs to grab it. */

if (0==thread.x)  while (0==AtomicExch(&global_mutex, 0));  

/* execution that reaches here is now in a block-level mutex! */

/* safely update pixel + neighborhood for this block */

for (int dx=-1; dx<2; ++dx) 

  for (int dy=-1; dy<2; ++dy) {

    if (x+dx>=0 && x+dx<maxX && y+dy>=0 && y+dy<maxY) {

       /* update pixel  (x+dx, y+dy) here. This may be a read + write, or just a write. */

      } // update scope

     } // inrange check scope



 /* block has now updated all of its pixels. Release the mutex. */

if (0==thread.x) AtomicExch(&global_mutex, 1);

You can do it without any global locking, too. Just have a separate memory area where you store the lower and right border pixels of each block (one row/column inside the block and one outside) instead of the real image. Well, and another two areas for the corner pixels.

Then start a second kernel that merges those border values together with the existing value there. No locking, completely parallelizable, but one more kernel call overhead.

Or your idea, just done in the “not insane” way:

Why each thread? You can never have more than 9 threads trying to write the same location, so you need only 9x400x400, one each for each neighboring pixel a thread might write (i.e. writes to the upper left go to memory area 0, writes to directly above to area 1 etc.). Second step is a reduction of those 9 areas.

Probably slower than the first suggestion but much, much simpler: no locking/conflicts at all except for the generic reduction in the second step.

I solved the issue by using a global mutex with very low performance hit (an extra 0.01s) because my writes are very sparse.

Why I like this method? It keeps the code looking simple.

The mutex is atomicFloatAdd from Atomic float operations

Thanks guys

This seems to be a surprisingly effective method of optimizing CUDA code :)

Thanks for sharing the solution.

/* Wait for global mutex. Only thread 0 of the block needs to grab it. */

if (0==thread.x) Â while (0==AtomicExch(&global_mutex, 0)); Â 

/* execution that reaches here is now in a block-level mutex! */

/* safely update pixel + neighborhood for this block */

How would you know that the warp containing thread 0 is first to be scheduled? Maybe a warp containing threads 32-63 comes first - hence it would not try to acquire this lock and not continue safely…

I think a synchtreads() after the if (0 == threadIdx.) is in order here

Oh by the way I am currently facing a quite similar problem. I am trying to reduce memory usage of my Hough transform. I am trying to come up with mutexes to lock read-modify-write access to specific global memory regions such that I can safely add intermediate results in there. So far my attempts at implementing those Mutexes have been a disaster.

The atomic float add worked though, but it made kernel execution time skyrocket from 36ms to 150ms… Ugh. ;)


Yep! You’re right, you need the __syncthreads() after the test. That’s what I get for writing code in a reply without actually compiling or testing it. :-)

Yeap, well, it works well on with algorithms that have sparse writes.

Since now I know my code works, thanks to the mutex, I’m going to create shared mem for every block with needed pixels + 1 extra pixel radius. So 3x3 + 1 radius = 4x4. And then, I’ll the kernels write to it. Once computation is over, ill use atomicFloatAdds for areas that overlap. That should reduce times.

EDIT: Thinking again, should I even do it? I’ll enlarge my kernel.

Currently it is:

ptxas info �   : Used 31 registers, 36+0 bytes lmem, 60+56 bytes smem, 88 bytes cmem[1]

Bad local memory. Should I even give it a shot? How do I tell which variables are lmem?


Found out:

nvcc -ptx -I /usr/local/cuda/include/ -L /usr/local/cuda/lib -I /home/zakim/NVIDIA_CUDA_SDK/common/inc/ -L /home/zakim/NVIDIA_CUDA_SDK/lib/ -arch sm_11

What is this?

.local .align 4 .b8 __cuda___cuda_dataCS3244[36];

Is it possible that you express each output pixel as the weighted sum of several input pixels? (That method is sometimes called the read paradigm). For linear kernels (e.g. a convolution) there should be a simple mathematical equivalence. Then each thread can compute exactly one output pixel, for example performing texture reads from the required input pixels.

What you described is sometimes referenced as the “write paradigm”, where for each input pixel you explicitly contribute results to several output pixels.

Pros and cons of Read Paradigm

  • very practical for computing sub-sets of the output image only
  • better suited for CUDA because this allows for a block-wise output
  • may require a lot of memory bandwidth for reading the input data

Pros and cons of the Write paradigm

  • very suitable for sparsely populated input images (may only generate writes for every non-zero input pixel that is set)
  • creates race conditions when multiple writes collide: this can be solved either by mutexes or by having many intermedate result buffers that get merged in a second pass
  • requires lots of memory bandwidth for writing the output data in case of dense input data

I’ve posted a Hough transform kernel to this forum that implements the Write paradigm in two variants. One with high memory requirements and one with mutexes instead.