Parralel += on argument array Can this work

SDsynapse · December 13, 2007, 6:18pm

Hello all,

I am passing a large float array to the device, where the array is defined in global memory on the device. I would like to randomly += certain values in the array such as follows in parralel

global void
myfunction(float* d_values)
{
d_values[random_index_not_to_exceed_size] += value;
__syncthreads();
}

Where value is any value. My ouput is not the sum of all values but rather one value and sometimes two. Am I kidding myself thinking the device will know that it needs to add all values going to some random index, or must it explicitly be done one after another. If there is already discussion of this issue in another thread or in the manual a reference or an example would be appreciated.

Thanks

seibert · December 13, 2007, 6:27pm

Operations on global memory are not atomic unless you use the special atomic functions. See Appendix C in the programming guide for these functions. Note that only devices with compute capability 1.1 (everything but the 8800 GTX, 8800 GTS 320/640 MB, and Tesla) support atomic operations on global memory.

SDsynapse · December 13, 2007, 6:30pm

Sweet so if I have the 8800 GTX and the Tesla I should start thinking of another algorithm.

seibert · December 13, 2007, 6:46pm

Yes. Atomic access to global memory also hinders parallelism, so a different algorithm is likely to be faster. (Assuming there is another algorithm…)

Also, general point: the __syncthreads() call in your example doesn’t do anything. __syncthreads() synchronizes all the threads within a block, which is needed if you wrote something to shared memory in one thread that another thread needs to use. It does not synchronize threads in different blocks. And there is no need to call __syncthreads() at the end of a kernel.

SDsynapse · December 13, 2007, 7:08pm

Thanks for the help.

Sarnath · December 15, 2007, 4:15pm

Also note that even with atomic operations, it is only gauranteed that all writes will happen. But the order in which the WRITEs happen is NOT deterministic. However you might want to implement some kindaa lock in global memory to enforce synchronisation. At the end of it, what you might most likely find is that you are better-off to do the computation with the CPU – with due respects to GPU. There are lotssa threads in this forum that talk about “block synchronization” with atomic operations. Please run the search if you would be interested.

The bottomline is that you ought to know which is best when done on GPU.

For example: You could generate the random numbers in CPU, sort them in order so that you might take advantage of global memory coalescing option and then pass both the arrays to GPU. The bigger and bigger the array gets, the faster would be your GPU’s performance.

I recently ran some simple benchmarks – like transforming an array of floats (just by multiplying by 2) and I found that the actual time taken by GPU did NOT increase linearly as the data-set increased it would be the case with CPU. Moreover to hide the overhead of your driver (like doing cudaMalloc, cudaMemcpy etc…), you should make sure that your data-set is quite huge.

Topic		Replies	Views
Atomic operations across a warp in parallel for CC2.0 devices CUDA Programming and Performance	0	2650	June 29, 2010
Global memory access how to access the same location sequentially from different threads CUDA Programming and Performance	4	4417	July 29, 2010
problem with atomic operations on global memory implement mutual exclusion with atomicop CUDA Programming and Performance	1	1644	February 26, 2008
problem about the GPU thread CUDA Programming and Performance	2	1081	May 7, 2009
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25973	July 5, 2007
question on atomic operation CUDA Programming and Performance	2	3442	November 16, 2011
Possible problem with atomic on global memory CUDA Programming and Performance	8	1377	November 20, 2013
Concurrent writes by different blocks in a kernel CUDA Programming and Performance	4	1130	December 14, 2011
help wanted global memory update CUDA Programming and Performance	3	2023	July 11, 2010
Threads and Race Condition CUDA Programming and Performance	11	3096	April 30, 2012

Parralel += on argument array Can this work

Related topics