Is there a cuda function call or a nvcc intrinsic instruction that forces a modification to global memory on the device to be visible to the other threads.
The situation of my program is as follows:
Many threads will cooperate to generate the cells of a 2D array, every thread will add a fraction to the final value of the cell.
The threads may go through different execution paths to generate the fraction for the same cell, say [100, 120], or even certain threads may skip contributing to certain cells.
Using shared memory for reduction, requires you to use __syncthreads() and do reduction. But because of the different paths doing reduction is not trivial. After a sync point, it is possible that thread 1 computed its contribution to cell [10, 20] while thread 2 computed its contribution to cell [15, 22]
If updates to global memory are instantly visible to all other threads, correctness can be preserved even if performance will suffer
Thank you for any help