how to use shared memory

curiousj · September 2, 2010, 10:28pm

I am trying to use CUDA to speed up my program. But I am not very sure how to use the share memory. I bought the book “Programming massively parallel processors” which has some samples, but I feel the sample (matrix computing) is way too easy and real world application is not so easy to follow the sample.

In my case, I want to use an 1d array of float values (suppose it has m elements, m is about 1000) to interpolate a 2d array (suppose it has n * k elements, n and k is about 500). When calculating any one element in the 2d array, it needs two items of the 1d array to calculate. But using which 2 items depends on some condition. I am sure it is a very simple program and it is a good candidate program for CUDA.

It is very easy to implement a simple code to use GPU to calculate, but it is actually way slower (5x) than regular CPU code. Then I start to look into reduce the global memory access ratio. Of course the first step is, trying to put the 1d array (about 4k in size) into shared memory of blocks. That supposes to be must faster than the array is in global memory. But I am not sure how do put the 1d array into the shared memory.

The sample code on the book is like this,

shared float Mds[TILE_WIDTH][TILE_WIDTH]; // TILE_WIDTH is constant value 2.

But when it access it, it looks like the index is greater than TILE_WIDTH.

Here is my simplified kernel code (my 2d array is still represented as 1d array, stored row by row).

global void MyKernel(float* array1d, float* array2d, int m, int n, int k, /* some other args */)
{
// obviously in this case, both array 1d and array2d is in global memory.
// question is how to put array1d in shared memory?
int i = blockIdx.x;
int j = threadIdx.x;
int index = j * k + i
int i0 = x; // x actually is calculated from i, j and other args.
int i1 = x + 1;
array2d[index] = (array1d[i0] + array1d[i1]) / 2.0f; // actually is a more complicated and time consuming computing with other args.
}

Here is how MyKernel is called,

dim3 dimBlock(n, 1);
dim3 dimGrid(k, 1);
MyKernel<<<dimGrid, dimBlock>>>(array1d, array2d, m, n, k, … other args… ).

Anybody can give me some advice? Thanks in advance.

kbam · September 3, 2010, 12:33am

Easiest way to do shared is to just declare the shared in the cuda code but not inside a function

#define SHALINE 1056

__shared__ short shLine[SHALINE ];  // can be multidimensional

__device__ void populateShLine(char* d_line, long len, char* d_unProcd)

{

..blah..

	shLine[threadIdx.x] = d_unProcd[someIndex];  // d_unProcd is a global array i.e. in the GPU's main RAM

..blah..

for (...)

  {

	 do something using  shLine[]

  }

	

}

(shLine might be your array1d )

shared arrays are in the on-chip memory and have a latency of about 6 cycles

global arrays are in the GPU’s main RAM and have a latency of hundreds of cycles (but if another block can run during that time it may not matter)

advantage of using shared is when that shared array will be accessed many times, e.g. inside a loop

kbam · September 3, 2010, 12:33am

Easiest way to do shared is to just declare the shared in the cuda code but not inside a function

#define SHALINE 1056

__shared__ short shLine[SHALINE ];  // can be multidimensional

__device__ void populateShLine(char* d_line, long len, char* d_unProcd)

{

..blah..

	shLine[threadIdx.x] = d_unProcd[someIndex];  // d_unProcd is a global array i.e. in the GPU's main RAM

..blah..

for (...)

  {

	 do something using  shLine[]

  }

	

}

(shLine might be your array1d )

shared arrays are in the on-chip memory and have a latency of about 6 cycles

global arrays are in the GPU’s main RAM and have a latency of hundreds of cycles (but if another block can run during that time it may not matter)

advantage of using shared is when that shared array will be accessed many times, e.g. inside a loop

curiousj · September 5, 2010, 3:46am

Easiest way to do shared is to just declare the shared in the cuda code but not inside a function
#define SHALINE 1056

__shared__ short shLine[SHALINE ];  // can be multidimensional

__device__ void populateShLine(char* d_line, long len, char* d_unProcd)

{

..blah..

	shLine[threadIdx.x] = d_unProcd[someIndex];  // d_unProcd is a global array i.e. in the GPU's main RAM

..blah..

for (...)

  {

	 do something using  shLine[]

  }

	

}
(shLine might be your array1d )

shared arrays are in the on-chip memory and have a latency of about 6 cycles

global arrays are in the GPU’s main RAM and have a latency of hundreds of cycles (but if another block can run during that time it may not matter)

advantage of using shared is when that shared array will be accessed many times, e.g. inside a loop

Thanks for the reply. I kind of got shared memory working and got some speed improvement. But I got a strange problem. If I do not use shared memory, all of my array2d values are calculated correctly. But if I use shared memory,

only part of the values are correct. And I see a pattern - when the array2d’s index sqrt(ii + jj) >= 256, the array2d element is not correct. (e.g., array2d[160][199], array2d[161][198]).I wonder it is caused by shared memory size limit. But my grid is not big at all. Not sure about the shared memory trick yet.

curiousj · September 5, 2010, 3:46am

Easiest way to do shared is to just declare the shared in the cuda code but not inside a function
#define SHALINE 1056

__shared__ short shLine[SHALINE ];  // can be multidimensional

__device__ void populateShLine(char* d_line, long len, char* d_unProcd)

{

..blah..

	shLine[threadIdx.x] = d_unProcd[someIndex];  // d_unProcd is a global array i.e. in the GPU's main RAM

..blah..

for (...)

  {

	 do something using  shLine[]

  }

	

}
(shLine might be your array1d )

shared arrays are in the on-chip memory and have a latency of about 6 cycles

global arrays are in the GPU’s main RAM and have a latency of hundreds of cycles (but if another block can run during that time it may not matter)

advantage of using shared is when that shared array will be accessed many times, e.g. inside a loop

Thanks for the reply. I kind of got shared memory working and got some speed improvement. But I got a strange problem. If I do not use shared memory, all of my array2d values are calculated correctly. But if I use shared memory,

only part of the values are correct. And I see a pattern - when the array2d’s index sqrt(ii + jj) >= 256, the array2d element is not correct. (e.g., array2d[160][199], array2d[161][198]).I wonder it is caused by shared memory size limit. But my grid is not big at all. Not sure about the shared memory trick yet.

Deus · September 5, 2010, 6:23am

Note, only the threads in a same block can use shared memory…

Deus · September 5, 2010, 6:23am

Note, only the threads in a same block can use shared memory…

Topic		Replies	Views
using shared memory CUDA Programming and Performance	6	3014	September 17, 2009
Shared memory example CUDA Programming and Performance	1	1461	April 24, 2012
shared memory access CUDA Programming and Performance	3	2876	April 24, 2012
Dynamic Shared memory CUDA Programming and Performance	3	6161	June 4, 2009
Copying data into shared memory CUDA Programming and Performance	9	3865	July 1, 2009
problem with shared mamery CUDA Programming and Performance	4	3235	May 11, 2009
using a component in the share memory twice CUDA Programming and Performance	4	2680	September 22, 2010
shared memory wrong allocation? CUDA Programming and Performance	2	879	July 29, 2009
shared memory using shared memory CUDA Programming and Performance	2	5588	February 25, 2012
Transfer a one-dimensional array saved by rows-major from global memory to shared memory CUDA Programming and Performance cuda	1	502	July 1, 2021

how to use shared memory

Related topics