Simple problem - but how to do fast! Suggestions welcome

First off - I’m new to gpu/cuda computing. I’ve spent a few weeks reading up and trying a few ideas out but I could do with the collective advice of the community. I have a solution but I’m sure I (we) could do better.

I have a 2D float array of data say, 1024 x 512. I need to perform many small (say 32 element) summations from this array which reprsent curves in the array and store the results in another array (intensity). A separate array represents the curves i.e. 1024 x 32 and provides the column offsets in the data array. I perform the 1024, 32 element sums, then move down the data array 1 sample and repeat all the sums.

That is my attempt to describe it in words. Below is the simple kernel I have used to do this. For the example dimensions I have suggested above my threadblock size is [1024,1] and my gridsize is [32, 512]. As you will see I use atomic adds.

global void array_sums4( float * intensity, const float * data, const int * curves)
int idx1 = threadIdx.x + blockDim.xblockIdx.y;
int idx2 = threadIdx.x + blockDim.x
int idx3 = curves[idx2] + (blockIdx.x + blockIdx.y)*(blockDim.x+16); //+16 as data array is actually 16 elements longer

Any suggestions on a better approach to this problem?

Use shared memory to perform atomic add inside a block, and then do the atomic add to the global memory.