Global memory access how to access the same location sequentially from different threads

athlonshi · July 28, 2010, 6:08pm

Hi,

I have a piece of code like below in a kernel function

A[i] = A[i] + b;

“i” is the array index, which is calculated in each thread. The problem is that “i” can be the same in several threads, so when execute the kernel function, thread may access the same memory location at the same time.

As the size of the array A is large, I have to put it into the device global memory, and I also defined it as “volatile” type. But the results were not as expected.

Is there tricky thing that I am missing here? Thanks!

Yu

tera · July 28, 2010, 7:00pm

atomicAdd()

athlonshi · July 28, 2010, 7:52pm

Thanks, but my array is double precision one and I am using GTX 260, which has compute capacity 1.3

According to the CUDA programing guide, atomicAdd does not support float point operation except for the compute capacity 2.0 or above

Is there any other way?

Jens_2 · July 28, 2010, 9:55pm

Hi Athlonshi,

have you seen this thread. They come up with a replacement for atomic add operations for floats. I can hardly say somthing about the correctness because I didn’t try it myself as it seems to expensive.

If threads only need to be synchronized over a warp (Maybe you have the luck that this is true for you) this tagging mechanism in the Histogram Example will work for you.

Otherwise to overcome the atomic operations completely you might write the result of a single thread to a temporary buffer B with the size of the thread grid e.g. expect that you have 5 threads and each thread writes its value b. Lets say B is equal to {7,15,3,3,5}. So each thread has written to its designated location without memory collisions. Additionally, store the write address in a separate address buffer, e.g the value b=7 of thread 0 should be added to address ‘5’, the value b=15 which comes form thread 1 should have been added to address ‘1’ an so on which results in A= {5,1,2,1,1}. Sort both buffers with CUDPP radix sort for their write address which would result in the following two sorted buffers A’ = {1,1,1,2,5} and B’ = {15,3,5,3,7}. Now you have to compact both buffers. Therefore compute a marker buffer M = {1,0,0,1,1} which is set according to following rule: if( A’(i) != A’(i-1) ) M(i) = 1 else M(i) = 0. During this operation write the current thread index into a separate buffer I = {0,1,2,3,4}. A CUDPP compaction of I with M results in I’={0,3,4}. Now if you run a thread for each element of I’ you can add up the addressed values from I’(i) to I’(i+1) without any write collisions. The values of I’ will address the values of B’ accordingly. See the CUDPP documentation for the compact and sort primitives. This mechanism I have first seen in the paper of Zhou et al. for their data Data Parallel Octrees (Step 3 to 5).

I hope this will help you.

athlonshi · July 29, 2010, 2:20pm

Hi, Jens:

Thanks very much for addressing my problem in detail.

I will try to compare them or getting a card with a compute capacity of 2.0 is also a plan.

Yu

Hi Athlonshi,

have you seen this thread. They come up with a replacement for atomic add operations for floats. I can hardly say somthing about the correctness because I didn’t try it myself as it seems to expensive.

If threads only need to be synchronized over a warp (Maybe you have the luck that this is true for you) this tagging mechanism in the Histogram Example will work for you.

Otherwise to overcome the atomic operations completely you might write the result of a single thread to a temporary buffer B with the size of the thread grid e.g. expect that you have 5 threads and each thread writes its value b. Lets say B is equal to {7,15,3,3,5}. So each thread has written to its designated location without memory collisions. Additionally, store the write address in a separate address buffer, e.g the value b=7 of thread 0 should be added to address ‘5’, the value b=15 which comes form thread 1 should have been added to address ‘1’ an so on which results in A= {5,1,2,1,1}. Sort both buffers with CUDPP radix sort for their write address which would result in the following two sorted buffers A’ = {1,1,1,2,5} and B’ = {15,3,5,3,7}. Now you have to compact both buffers. Therefore compute a marker buffer M = {1,0,0,1,1} which is set according to following rule: if( A’(i) != A’(i-1) ) M(i) = 1 else M(i) = 0. During this operation write the current thread index into a separate buffer I = {0,1,2,3,4}. A CUDPP compaction of I with M results in I’={0,3,4}. Now if you run a thread for each element of I’ you can add up the addressed values from I’(i) to I’(i+1) without any write collisions. The values of I’ will address the values of B’ accordingly. See the CUDPP documentation for the compact and sort primitives. This mechanism I have first seen in the paper of Zhou et al. for their data Data Parallel Octrees (Step 3 to 5).

I hope this will help you.

Topic		Replies	Views
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25950	July 5, 2007
Possible problem with atomic on global memory CUDA Programming and Performance	8	1374	November 20, 2013
help wanted global memory update CUDA Programming and Performance	3	2019	July 11, 2010
problem about the GPU thread CUDA Programming and Performance	2	1081	May 7, 2009
Variable Number of Results CUDA Programming and Performance	3	1745	April 10, 2009
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	57026	July 29, 2010
atomic read or write CUDA Programming and Performance	3	4239	July 15, 2009
Threads and Race Condition CUDA Programming and Performance	11	3090	April 30, 2012
Parralel += on argument array Can this work CUDA Programming and Performance	5	3248	December 15, 2007
atomicAdd not behaving as expected, atomicAdd_system not defined CUDA Programming and Performance	3	1638	September 5, 2022

Global memory access how to access the same location sequentially from different threads

Related topics