I’m writing a kernel to do some arithmetic operations involving large arrays. However, each thread needs to know the value 5 integer arrays of size 10.
// b0 to b0 is one set of integer arrays
// b1 to b1 is another set of integer arrays
// b4 to b4 is the final set of integer arrays
// all threads know const1, const2 etc…
int const1 = b0 + …some operation
int const2 = b1 + …some operation
//Assume _u and _v are in global memory…
deviceFunction( _u[ threadIdx.x + const1], _v[threadIdx.x + const2] );
deviceFunction does operations on const1, const2… constN , _u and _v
The question is, if I let the arrays b0 t0 b4 lie in global space… I get the expected output, but I have multiple calls to global memory (const1, const2… etc), and it incurs lots of clock cycles. How can I utilize shared memory for the array b, given the fact that the threads per block will be 512 but the array b will always be of size 10.