Strided memory access help

I am implementing a delay sum beamformer with cuda, I am having a large performance hit when I take the delayed value out because the memory accesses are not coalesed as beam one will want sample 8, beam two will want sample 15, beam three will want sample 31 of channel one, then something similar will happen to channel two, etc. And suddenly I have barely any parallelisation and the beamformer grinds to a halt and cant run in real time. The amount of data I have for each channel exceeds 1024 and the maximum delays for the beams can exceed 1024.

Are there any sources of information which show an efficient way to deal with these problems where strided access to memory occurs a lot in cuda?

There is no magic bullet in CUDA to benefit scattered access to global memory. The usual suggestion is to rearrange your data storage scheme so that adjacent threads can access adjacent locations in memory. Barring that, some benefit may be gained through careful use of caches. For purposes of this discussion so far, the use of the word “random” there can be equivalently replaced with “structured, but scattered”.

When all of the data must be accessed anyway, with some temporal locality, and it is merely the alignment of data to threads that presents issues (not entirely evident from your description), a library like trove might be useful