shared memory computation

Dominic_Chandar · September 30, 2010, 3:06pm

I’m writing a kernel to do some arithmetic operations involving large arrays. However, each thread needs to know the value 5 integer arrays of size 10.
Eg:

global kernel()
{

// b0[0] to b0[9] is one set of integer arrays
// b1[0] to b1[9] is another set of integer arrays
…
…
// b4[0] to b4[9] is the final set of integer arrays

// all threads know const1, const2 etc…
int const1 = b0[0] + …some operation
int const2 = b1[0] + …some operation

//Assume _u and _v are in global memory…

deviceFunction( _u[ threadIdx.x + const1], _v[threadIdx.x + const2] );

}

deviceFunction does operations on const1, const2… constN , _u and _v

The question is, if I let the arrays b0 t0 b4 lie in global space… I get the expected output, but I have multiple calls to global memory (const1, const2… etc), and it incurs lots of clock cycles. How can I utilize shared memory for the array b, given the fact that the threads per block will be 512 but the array b will always be of size 10.

-Dominic