Making changes to Global Memory visible

Is there a cuda function call or a nvcc intrinsic instruction that forces a modification to global memory on the device to be visible to the other threads.

The situation of my program is as follows:

Many threads will cooperate to generate the cells of a 2D array, every thread will add a fraction to the final value of the cell.
The threads may go through different execution paths to generate the fraction for the same cell, say [100, 120], or even certain threads may skip contributing to certain cells.

Using shared memory for reduction, requires you to use __syncthreads() and do reduction. But because of the different paths doing reduction is not trivial. After a sync point, it is possible that thread 1 computed its contribution to cell [10, 20] while thread 2 computed its contribution to cell [15, 22]

If updates to global memory are instantly visible to all other threads, correctness can be preserved even if performance will suffer

Thank you for any help

I don’t think this will work. Do you want multiple threads to add their computed values to (potentially) the same memory space? Global memory updates alone won’t help you with this.

Performance degradation aside, if global memory updates are visible to all threads, retrieving the current value from memory updating it and storing it back will preserver correctness, assuming it is possible to do read/modify/update in one atomic operation

Usually when many thread are collaborating on a single computation, reduction is used, but the divergence of the code renders reduction a non-trivial operation

Atomic operations are indeed supported on CUDA on Compute Capability 1.1 and newer cards. Look up __threadfence() in the new programming guide (Appendix B.5).