Better/Faster way to access memory for problem

Hello everyone
I am extremely new to CUDA, and I am not the greatest programmer to say the least, but I am hoping someone out there can help me out for improving memory access. I am guessing this has already been discussed somewhere, but I am such a low-level programmer that I wouldn’t know what to call this problem to start searching:

Suppose I have an array of length N (we will call it “arr”), where N can be quite large. The values in this array can take on integers from 0,…p,…,P, where P is quite small (<10). Say I have some 2D array called “value” which is size N in length and P in width (2D is easiest for description, but this can be easily transformed into a 1D array of length N*P).

What I would normally do for unoptimized (and 2D array for simplification) is:
Sum of i from 1 to N ( value[ i ][ arr[ i ] ] )

I know that this sum can be done via reduction, but is there some sort of structure that I can use differently that would improve memory access? When this 2D array “value” is transformed into 1D, it is going to take on length N*P, and there are 2 different ways to set up this array. The first way is to break it down is to have P groups of length N, and the other is to have N groups of length P (I hope this makes sense). Is there a structure that is better to access memory-wise?

For example, for each “i” in that sum, I have to perform 2 calls to memory, 1 for p=arr[i] and another for value[i][p]. Right now, both of these are in global memory, and I am interested if anyone has a better/faster way of solving this.

I greatly (GREATLY) appreciate any insight you can offer!!!

Unfortunately, as the problem guarantees that the memory accesses to value have little locality, there is not much you can do about it. Unless you have additional information about the problem that would allow the data in value to be rearranged in a way that provides more locality in accesses.

Many thanks for you time and help!