Is it helpful to process data in sequential non-overlapping groups?

I am very much a novice CUDA programmer and am trying to understand the best way to deal with situations where different threads need access to overlapping global memory. Specifically, I have large arrays z[k], k=1…K and x[j], j=1,…J. One set of operations that I want to parallelize is

xf = floor(x[j]);

p[j]=w*z[xf]+(1-w)*z[xf+1];  //result

This is simply a collection of interpolations of z at different locations x[j]. However, I am also interested in the transpose operation

xf = floor(x[j]);

z[xf]   += w*p[j];
z[xf+1] += (1-w)*p[j];

So clearly, if each j is processed by a thread, then whenever |x[j0]-x[j1]|<1, the different threads j0,j1 will be trying to read from and write to a common global memory location z[xf].

I know that one strategy for optimizing overlapping read/write operations is to try to group threads requiring common data into blocks and to read the common data once into shared block memory. However, in my case the x[j] are not pre-sorted or regularly spaced in any particular way, so there doesn’t appear to be a naturally parallel way for blocks to find the x[j] that belong to them.

Instead, I was wondering if it would be worthwhile at all to pre-group the x[j] into subsets such that all x[j] in a subset are at least 2 spaces apart, and then I would launch the kernel on one subset at a time. When restricted to subsets like these, it is guaranteed that no thread will read/write to overlapping locations. However, this pre-processing would involve significantly more coding complexity, so before I undertake it, I wanted to know if I can expect it to be worthwhile.

Assuming single precision arithmetic is acceptable, you can also handle this situation using atomic addition. That might not perform as well as the grouping technique you describe, but it is very easy to code up.

Your code and problem looks very similar to some CUDA implementations of discrete probability dynamic programming problems. I posted at least 5 to GitHub;

I did indeed use atomics , but the hack version for 64-bit doubles.

Note: that code was optimized for the Tesla line and used double precision. If on a gaming GPU cast to floats. Also those are fairly basic implementations, and I have more optimized code, but it will serve as a good starting point.

Note that if you go with the atomics approach, you will definitely want a compute capability 3.x card. The performance of atomic operations was significantly improved (7x or more) in that architecture.

Thanks for the responses. I assume the motive for suggesting atomic additions is because in the transpose interpolation operation I have multiple threads writing into the same memory location and that this is a thread safety issue? That is good to know.

However, my question was mainly concerning speed optimization. Take the forward interpolation operation where all operations look atomic, as far as I can see. Would code run faster if multiple threads are not trying to read from the same locations? Could breaking the forward interpolation up into separate kernel launches with non-overlapping memory accesses be faster than trying to do the entire data set in a single run? Note that the number of memory accesses doesn’t change in either scheme, so I’m wondering if I can expect a speed impact.