Parralel += on argument array Can this work

Hello all,

I am passing a large float array to the device, where the array is defined in global memory on the device. I would like to randomly += certain values in the array such as follows in parralel

global void
myfunction(float* d_values)
d_values[random_index_not_to_exceed_size] += value;

Where value is any value. My ouput is not the sum of all values but rather one value and sometimes two. Am I kidding myself thinking the device will know that it needs to add all values going to some random index, or must it explicitly be done one after another. If there is already discussion of this issue in another thread or in the manual a reference or an example would be appreciated.


Operations on global memory are not atomic unless you use the special atomic functions. See Appendix C in the programming guide for these functions. Note that only devices with compute capability 1.1 (everything but the 8800 GTX, 8800 GTS 320/640 MB, and Tesla) support atomic operations on global memory.

Sweet so if I have the 8800 GTX and the Tesla I should start thinking of another algorithm.

Yes. Atomic access to global memory also hinders parallelism, so a different algorithm is likely to be faster. (Assuming there is another algorithm…)

Also, general point: the __syncthreads() call in your example doesn’t do anything. __syncthreads() synchronizes all the threads within a block, which is needed if you wrote something to shared memory in one thread that another thread needs to use. It does not synchronize threads in different blocks. And there is no need to call __syncthreads() at the end of a kernel.

Thanks for the help.

Also note that even with atomic operations, it is only gauranteed that all writes will happen. But the order in which the WRITEs happen is NOT deterministic. However you might want to implement some kindaa lock in global memory to enforce synchronisation. At the end of it, what you might most likely find is that you are better-off to do the computation with the CPU – with due respects to GPU. There are lotssa threads in this forum that talk about “block synchronization” with atomic operations. Please run the search if you would be interested.

The bottomline is that you ought to know which is best when done on GPU.

For example: You could generate the random numbers in CPU, sort them in order so that you might take advantage of global memory coalescing option and then pass both the arrays to GPU. The bigger and bigger the array gets, the faster would be your GPU’s performance.

I recently ran some simple benchmarks – like transforming an array of floats (just by multiplying by 2) and I found that the actual time taken by GPU did NOT increase linearly as the data-set increased it would be the case with CPU. Moreover to hide the overhead of your driver (like doing cudaMalloc, cudaMemcpy etc…), you should make sure that your data-set is quite huge.