Memory management issues Global and Shared memory management

knmaheshy2k · January 2, 2009, 9:16pm

Hi all,

I’ve a program of this sort.

int main()

{

//These two arrays are used by all threads

src_array1[256] = {,,,,,};

src_array2[256] = {,,,,,};

//Data is of moderate size. Each thread operates on 16 bytes of data

src_data[1 MB] = {,,,,,};

src_result[1 MB];

//Allocating device memory for arrays

cudaMalloc( (void**)&array1, array1Size);

cudaMalloc( (void**)&array2, array2Size);

//Allocating device memory for data n result

cudaMalloc( (void**)&data, dataSize );

cudaMalloc( (void**)&result, resultSize );

//copy the arrays, data to device memory

cudaMemcpy( data, src_data, dataSize, cudaMemcpyHostToDevice );

cudaMemcpy( array1, src_array1, array1Size, cudaMemcpyHostToDevice );

cudaMemcpy( array2, src_array2, array2Size, cudaMemcpyHostToDevice );

// Initiate kernel (3000 blocks and each block has 16 threads)

dim3 dimGrid(3000,1,1);

dim3 dimBlock(16,1,1);

my_kernel<<< dimGrid, dimBlock >>>(result,data,array1,array2);

cudaThreadSynchronize();

//After processing data is written back to result in global memory

cudaMemcpy(src_result,result, resultSize, cudaMemcpyDeviceToHost );

}

__global__ void my_kernel(*result, *data, *array1, *array2)

{

//processing of data by using array1 and array2

.....

....

problem 1: Here when I read array1 and array2 from global memory it takes lot of time

problem 2: array1 and array2 are sort of look up tables, so i need to access them quite often

....

....

// Copying the processed data to result in global memory

......

Problem 3: While writing the results back to global memory, I observe that a lot of time is taken

.....

}

I need suggestions regarding memory management. Most of my time is wasted in accessing array1, array2 and writing processed data back to global memory.

Thanks in advance!

cbuchner1 · January 2, 2009, 9:42pm

Read the sections in the programming guide describing the concept of coalesced memory access. Only with coalesced access you get peak memory bandwidth performance. Often it may be necessary to transfer data into shared memory first using coalesced access. Random access to shared memory will be much faster then.

Identify your hardware: machines with compute capability 1.2 and greater have more relaxed coalescing rules.

Random access to global memory is an order of magnitude slower than coalesced access. If you can (and size allows it), put look up tables into the 64kb of constant memory.

Christian

Ocire · January 2, 2009, 9:45pm

take a look at the developer’s guide, especially at the sections about shared memory, constant memory, texture memory and coalescing in global memory. those are exactly the things you need. ;-)

SPWorley · January 2, 2009, 10:18pm

A small 256 long lookup table like yours can be copied into shared memory for best speed, or to constant memory for “OK” speed. Worst is to leave it in global memory, which is miserable for random lookups.

As Christian and Ocire said, read the programming guide about the different memory types and properties… it’s the key behavior you need to understand in CUDA for performance in almost any GPU app.

knmaheshy2k · January 3, 2009, 6:36pm

@SPWorley, @Ocire and @Christian

Thanks for your speedy replies. I followed your ideas and used constant memory and got speed up of 5x. Initially I had an kernel execution time of 30ms sec. Now with look up table in constant memory, the execution time is 6 ms. Thanks once again!

Now if I observed that the time consuming part in my kernel is - reading of data from global memory and writing of result back to global memory. Any suggestions to improve this performance?

Secondly I would request to provide a simple code that has implemented shared memory in it.

Thanks in advance!

Ocire · January 3, 2009, 7:38pm

if each of your threads only reads one value from global memory and writes back another one, shared memory won’t bring you any benefit.

you have to pay attention that your reads and writes are coalesced. (see programmer’s guide)

and use a multiple of 64 threads per block to get best performance. (if your number of registers allows for that)

you could use shared memory for your lookup tables, your kernel would then look something like that:

__global__ static void kernel(float *result, float *data, float *lut1, float *lut2){

  __shared__ float lut1_s[256]; //this is shared between all threads of a block

  __shared__ float lut2_s[256];

  if(threadIdx.x<256){

	lut1_s[threadIdx.x]=lut1[threadIdx.x];

	lut2_s[threadIdx.x]=lut2[threadIdx.x];

  }

  _syncthreads();

  //your normal calculations, just using lut1_s instead of lut1, etc.

}

this will work, if you specify at least 256 threads per block. if you have less, each thread will have to read more than 1 value.

the use of shared memory here is advantageous if you use at least a few of those lookup values in your code.

knmaheshy2k · January 3, 2009, 8:58pm

@Ocire

Awesome! I reduced 1.2 ms more in my kernel execution time! Thanks a lot! Now I’m trying out a change in my logic which should reduce 2ms more! Will get back to you ppl if I get stuck. Thanks once again. Have a great weekend!

SPWorley · January 4, 2009, 1:48am

kn: is your application really an arbitrary lookup table?

If you’re trying to optimize some function by precomputing values, you may find that a GPU’s math ops are much faster than lookup tables.

So in The Good Old Days you might apply a gamma ramp to some pixel values using a lookup table to avoid the slow powf() function, but the New Cool Way is just to use math and avoid memory accesses whenever possible.

This may not apply to your problem at all, but it’s something to keep in mind in general.

knmaheshy2k · January 4, 2009, 2:29am

My look up table is not arbitrary. But its a mathematical expression. I tried the way you told, but it looks like its degrading the performance cuz the look up table calculation is taking time.

As you told in other cases, math expressions are much much faster than memory access

Good point made SPWorley!

–

Mahesh

Ocire · January 4, 2009, 7:09am

have you tried the fast versions of the functions? meaning __powf() instead of powf(), reciprocal square root, etc.?

knmaheshy2k · January 4, 2009, 1:22pm

Unfortunately for this look up table, none of the functions except xor and shifting are used. So didnt try out other things. However for my next project I shall be using all of these.

–

Mahesh

Thomas_Krenn · March 2, 2009, 6:16pm

if each of your threads only reads one value from global memory and writes back another one, shared memory won’t bring you any benefit.

you have to pay attention that your reads and writes are coalesced. (see programmer’s guide)

and use a multiple of 64 threads per block to get best performance. (if your number of registers allows for that)

you could use shared memory for your lookup tables, your kernel would then look something like that:
__global__ static void kernel(float *result, float *data, float *lut1, float *lut2){

  __shared__ float lut1_s[256]; //this is shared between all threads of a block

  __shared__ float lut2_s[256];

  if(threadIdx.x<256){

	lut1_s[threadIdx.x]=lut1[threadIdx.x];

	lut2_s[threadIdx.x]=lut2[threadIdx.x];

  }

  _syncthreads();

  //your normal calculations, just using lut1_s instead of lut1, etc.

}
this will work, if you specify at least 256 threads per block. if you have less, each thread will have to read more than 1 value.

the use of shared memory here is advantageous if you use at least a few of those lookup values in your code.

if(threadIdx.x<256){

	lut1_s[threadIdx.x]=lut1[threadIdx.x];

	lut2_s[threadIdx.x]=lut2[threadIdx.x];

  }

The above code causes 256 accesses to global memory. I would prefer to have a memcpy instead of an assignment.

E.D_Riedijk · March 2, 2009, 10:23pm

if(threadIdx.x<256){

	lut1_s[threadIdx.x]=lut1[threadIdx.x];

	lut2_s[threadIdx.x]=lut2[threadIdx.x];

  }
The above code causes 256 accesses to global memory. I would prefer to have a memcpy instead of an assignment.

Why would you?

Topic		Replies	Views
Using Shared Memory in CUDA C/C++ Technical Blog	36	1950	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3304	January 10, 2010
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4346	December 4, 2009
Effective global memory bandwidth? CUDA Programming and Performance	17	17562	September 18, 2007
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2637	April 23, 2010
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4041	December 24, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5073	September 6, 2008
Global memory access cost CUDA Programming and Performance	4	2828	November 18, 2017
Coalesced Memory access related doubt CUDA Programming and Performance	13	1963	December 9, 2010
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4488	October 24, 2008

Memory management issues Global and Shared memory management

Related topics