Global memory access how to access the same location sequentially from different threads


I have a piece of code like below in a kernel function

A[i] = A[i] + b;

“i” is the array index, which is calculated in each thread. The problem is that “i” can be the same in several threads, so when execute the kernel function, thread may access the same memory location at the same time.

As the size of the array A is large, I have to put it into the device global memory, and I also defined it as “volatile” type. But the results were not as expected.

Is there tricky thing that I am missing here? Thanks!



Thanks, but my array is double precision one and I am using GTX 260, which has compute capacity 1.3

According to the CUDA programing guide, atomicAdd does not support float point operation except for the compute capacity 2.0 or above

Is there any other way?

Hi Athlonshi,

have you seen this thread. They come up with a replacement for atomic add operations for floats. I can hardly say somthing about the correctness because I didn’t try it myself as it seems to expensive.

If threads only need to be synchronized over a warp (Maybe you have the luck that this is true for you) this tagging mechanism in the Histogram Example will work for you.

Otherwise to overcome the atomic operations completely you might write the result of a single thread to a temporary buffer B with the size of the thread grid e.g. expect that you have 5 threads and each thread writes its value b. Lets say B is equal to {7,15,3,3,5}. So each thread has written to its designated location without memory collisions. Additionally, store the write address in a separate address buffer, e.g the value b=7 of thread 0 should be added to address ‘5’, the value b=15 which comes form thread 1 should have been added to address ‘1’ an so on which results in A= {5,1,2,1,1}. Sort both buffers with CUDPP radix sort for their write address which would result in the following two sorted buffers A’ = {1,1,1,2,5} and B’ = {15,3,5,3,7}. Now you have to compact both buffers. Therefore compute a marker buffer M = {1,0,0,1,1} which is set according to following rule: if( A’(i) != A’(i-1) ) M(i) = 1 else M(i) = 0. During this operation write the current thread index into a separate buffer I = {0,1,2,3,4}. A CUDPP compaction of I with M results in I’={0,3,4}. Now if you run a thread for each element of I’ you can add up the addressed values from I’(i) to I’(i+1) without any write collisions. The values of I’ will address the values of B’ accordingly. See the CUDPP documentation for the compact and sort primitives. This mechanism I have first seen in the paper of Zhou et al. for their data Data Parallel Octrees (Step 3 to 5).

I hope this will help you.

Hi, Jens:

Thanks very much for addressing my problem in detail.

I will try to compare them or getting a card with a compute capacity of 2.0 is also a plan.