How to implement shared memory of smaller size than problem?

Hello,

Consider a problem where variables are repeatedly read. Time will be saved when the variables are copied to shared memory for reading rather than repeatedly calling the global memory.

Question:

Is it possible to benefit from shared memory when the problem size is larger than the variable size being copied - if so how can one copy the smaller variable size into each block so that it can be read across many blocks?

ie.

int n = 64;
int blockdim = 32; // block_dim
int g_size = 12;

test<<<n/blockdim,blockdim,g_size*sizeof(float)>>>(global_var,g_size,n);

global void test(float*global_var,int g_size,int n)
{

int tx = threadIdx.x;

int i = threadIdx.x + blockIdx.x * blockDim.x; // size of n

extern shared float sdata;

// fill shared up to var size limit rather than problem size limit

if(i < g_size)
{
sdata[tx] = global_var[i];

  printf("share fill %i %f \n",blockIdx.x,sdata[0]); 

}

__syncthreads();

// will return empty sdata when 0 < blockIdx.x

printf(“share check %i %f \n”,blockIdx.x,sdata[0]);

}

The general technique you are looking for is called tiling. Load a few tiles into shared memory, work on them, then load the next tiles into the same block of shared memory, until all tiles combined have exhausted the totality of the data.

In the context of CUDA, it is frequently the case that each tile is worked on by one thread block. The exact arrangement, mapping of threads to data, is task specific, and one of the primary design decisions when writing parallel code. You might want to experiment with a few arrangements to get a feel for that.