First off - I’m new to gpu/cuda computing. I’ve spent a few weeks reading up and trying a few ideas out but I could do with the collective advice of the community. I have a solution but I’m sure I (we) could do better.

I have a 2D float array of data say, 1024 x 512. I need to perform many small (say 32 element) summations from this array which reprsent curves in the array and store the results in another array (intensity). A separate array represents the curves i.e. 1024 x 32 and provides the column offsets in the data array. I perform the 1024, 32 element sums, then move down the data array 1 sample and repeat all the sums.

That is my attempt to describe it in words. Below is the simple kernel I have used to do this. For the example dimensions I have suggested above my threadblock size is [1024,1] and my gridsize is [32, 512]. As you will see I use atomic adds.

**global** void array_sums4( float * intensity, const float * data, const int * curves)

{

int idx1 = threadIdx.x + blockDim.x*blockIdx.y;
int idx2 = threadIdx.x + blockDim.x*blockIdx.x;

int idx3 = curves[idx2] + (blockIdx.x + blockIdx.y)*(blockDim.x+16); //+16 as data array is actually 16 elements longer

atomicAdd(&intensity[idx1],data[idx3]);

}

Any suggestions on a better approach to this problem?