Shared memory write conflicts Looking for a little help...

I was wondering if anyone could help me solve a problem I’ve encountered in my CUDA code with regards to shared memory write conflicts.

Basic problem is that writes to shared memory aren’t atomic. Because of this, I sometimes get unpredictable values in my “results” array. I added the “new_i” variable so that threads potentially would never write to the same location at the same time. However, this would only be assured to work if all threads were executed 100% in parallel - which I know isn’t the actual case.

I’m thinking I can solve this more understand of warps? I’m fairly new to CUDA programming and haven’t had much experience with the concept.

Here some pseudo-code to explain my issue. It’s being launched with 256 threads per block.

__shared__ float results[256]

// Zero out results[]


for (j=0; j<100; j++)


 // Calculate valid

 if (valid)


    for(i=0; i < 256; i++)


       new_i = (i + threadIdx.x) % 256; 

      // Calculate "value"

      results [new_i] += value;





// Write results to global memory

Any help would be much appreciated. At first glance, the histogram64 SDK example might be useful.

I believe the unexpected results you’re seeing are due to race conditions in your code. As you mentioned, shared memory accesses are not atomic, so it is possible that another thread modified a particular location between reading it and writing of the incremented value (due to dynamic scheduling of warps).

The best thing would be to rewrite the program to where it doesn’t have such critical sections. If that’s not viable, you’ll have to go with one of the solutions used in non-atomic histogram samples.


Are there plans to add atomic writes to shared memory in future compute versions?

We don’t comment on unreleased hardware/software :)


I guess that is the policy :( , but I have the same issue and hope you can have some atomicAdd for float point.

I read the histogram project. For a BIN_COUNT histogram, it requires BIN_COUNT * THREAD_PER_BLOCK storage. And if you have more than 1 block per multiprocessor (which is recommended to have at least 2), that put more strain to the 16KB share memory. For my problem, my “histogram” is in 3-D and it has to be in the global memory. If my problem is a true histogram, I could use atomicAdd, however, the value increment is floating point. So I need atomicAdd() in floating point.

BTW, I understand that for 8800, and your tesla line, it currently on 1.0 compute capability. I assume that is limited by hardware design and can’t be changed, right?

My application has a 3D “Histogram” also. I’ve got an 8800 GTX, so no atomicAdd for me. Tests showed that the scatter memory access pattern of the atomicAdd “histogram” would kill my overall performance anyways. I tried another version of the program that had each block determine the value of a single “bin” of the histogram, but this has O(N^2) memory reads, so it is painfully slow for large datasets.

In the end, I found it faster to copy data back from the card histogram it on the CPU, then copy the histogram back to the card and continue the rest of the application on the GPU.