Fastest way to access precomputable arrays?

I’m writing a kernel which operates on input data via, among other things, a 4-iterated for-loop:

for (j=0; j<=P; j++)

        for (k=-j; k<=j; k++) {

            for (n=0; n<=j; n++) {

                for (m=-n; m<=n; m++) {

...

Within the for loop, I need to access a 2D array, A, as a function of j, k, m, and n, and does not depend on which thread/block is executing it. All the elements in A can be precomputed, but what’s the fastest way to access them? After all, I have to do so about P^4 times. Practical values of P are between 5-20. The size of A is O(P^2).

  1. I could precompute the values of A at the beginning of each kernel call.

1.a. I could get each thread to compute one element of A (shared memory), if the number of threads per block is P^2 or more.

  1. I could pass A as one of the data inputs to the kernel, but would that be considered an access to global memory? This can be precomputed on the CPU.

2.a. I could copy data input to the kernel to memory shared between all threads of a block.

2.b. I could copy it to local thread memory.

Which of the above might be the best option, or something else?

Hmm… I would precompute this on the host CPU and bind it to a texture. Then the texture cam stay unchanged over lots of kernel calls.

Does the A array consist of integer or floating point values?

This seems like it may be a job for constant memory.

If your four for loops are in the kernel itself, then constant memory would definitely be a good option. It works best when all threads access the same elements in constant memory at the same time.

Okay, I’ll have a look at using constant memory.

A contains floats, but will later on use doubles when the final version gets run on a GTX 260.

There’s a code example in the programming guide:

__constant__ float constData[256];

float data[256];

cudaMemcpyToSymbol(constData, data, sizeof(data));

Do I need to do anything different if constData is a 2D array of floats/doubles?