I’m writing a kernel which operates on input data via, among other things, a 4-iterated for-loop:
for (j=0; j<=P; j++)
for (k=-j; k<=j; k++) {
for (n=0; n<=j; n++) {
for (m=-n; m<=n; m++) {
...
Within the for loop, I need to access a 2D array, A, as a function of j, k, m, and n, and does not depend on which thread/block is executing it. All the elements in A can be precomputed, but what’s the fastest way to access them? After all, I have to do so about P^4 times. Practical values of P are between 5-20. The size of A is O(P^2).
- I could precompute the values of A at the beginning of each kernel call.
1.a. I could get each thread to compute one element of A (shared memory), if the number of threads per block is P^2 or more.
- I could pass A as one of the data inputs to the kernel, but would that be considered an access to global memory? This can be precomputed on the CPU.
2.a. I could copy data input to the kernel to memory shared between all threads of a block.
2.b. I could copy it to local thread memory.
Which of the above might be the best option, or something else?