how to use shared memory

I am trying to use CUDA to speed up my program. But I am not very sure how to use the share memory. I bought the book “Programming massively parallel processors” which has some samples, but I feel the sample (matrix computing) is way too easy and real world application is not so easy to follow the sample.

In my case, I want to use an 1d array of float values (suppose it has m elements, m is about 1000) to interpolate a 2d array (suppose it has n * k elements, n and k is about 500). When calculating any one element in the 2d array, it needs two items of the 1d array to calculate. But using which 2 items depends on some condition. I am sure it is a very simple program and it is a good candidate program for CUDA.

It is very easy to implement a simple code to use GPU to calculate, but it is actually way slower (5x) than regular CPU code. Then I start to look into reduce the global memory access ratio. Of course the first step is, trying to put the 1d array (about 4k in size) into shared memory of blocks. That supposes to be must faster than the array is in global memory. But I am not sure how do put the 1d array into the shared memory.

The sample code on the book is like this,

shared float Mds[TILE_WIDTH][TILE_WIDTH]; // TILE_WIDTH is constant value 2.

But when it access it, it looks like the index is greater than TILE_WIDTH.

Here is my simplified kernel code (my 2d array is still represented as 1d array, stored row by row).

global void MyKernel(float* array1d, float* array2d, int m, int n, int k, /* some other args */)
{
// obviously in this case, both array 1d and array2d is in global memory.
// question is how to put array1d in shared memory?
int i = blockIdx.x;
int j = threadIdx.x;
int index = j * k + i
int i0 = x; // x actually is calculated from i, j and other args.
int i1 = x + 1;
array2d[index] = (array1d[i0] + array1d[i1]) / 2.0f; // actually is a more complicated and time consuming computing with other args.
}

Here is how MyKernel is called,

dim3 dimBlock(n, 1);
dim3 dimGrid(k, 1);
MyKernel<<<dimGrid, dimBlock>>>(array1d, array2d, m, n, k, … other args… ).

Anybody can give me some advice? Thanks in advance.

Easiest way to do shared is to just declare the shared in the cuda code but not inside a function

#define SHALINE 1056

__shared__ short shLine[SHALINE ];  // can be multidimensional

__device__ void populateShLine(char* d_line, long len, char* d_unProcd)

{

..blah..

	shLine[threadIdx.x] = d_unProcd[someIndex];  // d_unProcd is a global array i.e. in the GPU's main RAM

..blah..

for (...)

  {

	 do something using  shLine[]

  }

	

}

(shLine might be your array1d )

shared arrays are in the on-chip memory and have a latency of about 6 cycles

global arrays are in the GPU’s main RAM and have a latency of hundreds of cycles (but if another block can run during that time it may not matter)

advantage of using shared is when that shared array will be accessed many times, e.g. inside a loop

Easiest way to do shared is to just declare the shared in the cuda code but not inside a function

#define SHALINE 1056

__shared__ short shLine[SHALINE ];  // can be multidimensional

__device__ void populateShLine(char* d_line, long len, char* d_unProcd)

{

..blah..

	shLine[threadIdx.x] = d_unProcd[someIndex];  // d_unProcd is a global array i.e. in the GPU's main RAM

..blah..

for (...)

  {

	 do something using  shLine[]

  }

	

}

(shLine might be your array1d )

shared arrays are in the on-chip memory and have a latency of about 6 cycles

global arrays are in the GPU’s main RAM and have a latency of hundreds of cycles (but if another block can run during that time it may not matter)

advantage of using shared is when that shared array will be accessed many times, e.g. inside a loop

Thanks for the reply. I kind of got shared memory working and got some speed improvement. But I got a strange problem. If I do not use shared memory, all of my array2d values are calculated correctly. But if I use shared memory,

only part of the values are correct. And I see a pattern - when the array2d’s index sqrt(ii + jj) >= 256, the array2d element is not correct. (e.g., array2d[160][199], array2d[161][198]).I wonder it is caused by shared memory size limit. But my grid is not big at all. Not sure about the shared memory trick yet.

Thanks for the reply. I kind of got shared memory working and got some speed improvement. But I got a strange problem. If I do not use shared memory, all of my array2d values are calculated correctly. But if I use shared memory,

only part of the values are correct. And I see a pattern - when the array2d’s index sqrt(ii + jj) >= 256, the array2d element is not correct. (e.g., array2d[160][199], array2d[161][198]).I wonder it is caused by shared memory size limit. But my grid is not big at all. Not sure about the shared memory trick yet.

Note, only the threads in a same block can use shared memory…

Note, only the threads in a same block can use shared memory…