I am very much a novice CUDA programmer and am trying to understand the best way to deal with situations where different threads need access to overlapping global memory. Specifically, I have large arrays z[k], k=1…K and x[j], j=1,…J. One set of operations that I want to parallelize is
xf = floor(x[j]); w=x[j]-xf; p[j]=w*z[xf]+(1-w)*z[xf+1]; //result
This is simply a collection of interpolations of z at different locations x[j]. However, I am also interested in the transpose operation
xf = floor(x[j]); w=x[j]-xf; //result z[xf] += w*p[j]; z[xf+1] += (1-w)*p[j];
So clearly, if each j is processed by a thread, then whenever |x[j0]-x[j1]|<1, the different threads j0,j1 will be trying to read from and write to a common global memory location z[xf].
I know that one strategy for optimizing overlapping read/write operations is to try to group threads requiring common data into blocks and to read the common data once into shared block memory. However, in my case the x[j] are not pre-sorted or regularly spaced in any particular way, so there doesn’t appear to be a naturally parallel way for blocks to find the x[j] that belong to them.
Instead, I was wondering if it would be worthwhile at all to pre-group the x[j] into subsets such that all x[j] in a subset are at least 2 spaces apart, and then I would launch the kernel on one subset at a time. When restricted to subsets like these, it is guaranteed that no thread will read/write to overlapping locations. However, this pre-processing would involve significantly more coding complexity, so before I undertake it, I wanted to know if I can expect it to be worthwhile.