Hello everyone

I am extremely new to CUDA, and I am not the greatest programmer to say the least, but I am hoping someone out there can help me out for improving memory access. I am guessing this has already been discussed somewhere, but I am such a low-level programmer that I wouldn’t know what to call this problem to start searching:

Suppose I have an array of length N (we will call it “arr”), where N can be quite large. The values in this array can take on integers from 0,…p,…,P, where P is quite small (<10). Say I have some 2D array called “value” which is size N in length and P in width (2D is easiest for description, but this can be easily transformed into a 1D array of length N*P).

What I would normally do for unoptimized (and 2D array for simplification) is:

Sum of i from 1 to N ( value[ i ][ arr[ i ] ] )

I know that this sum can be done via reduction, but is there some sort of structure that I can use differently that would improve memory access? When this 2D array “value” is transformed into 1D, it is going to take on length N*P, and there are 2 different ways to set up this array. The first way is to break it down is to have P groups of length N, and the other is to have N groups of length P (I hope this makes sense). Is there a structure that is better to access memory-wise?

For example, for each “i” in that sum, I have to perform 2 calls to memory, 1 for p=arr[i] and another for value[i][p]. Right now, both of these are in global memory, and I am interested if anyone has a better/faster way of solving this.

I greatly (GREATLY) appreciate any insight you can offer!!!