Memory management issues Global and Shared memory management

Hi all,

I’ve a program of this sort.

int main()

{

//These two arrays are used by all threads

src_array1[256] = {,,,,,};

src_array2[256] = {,,,,,};

//Data is of moderate size. Each thread operates on 16 bytes of data

src_data[1 MB] = {,,,,,};

src_result[1 MB];

//Allocating device memory for arrays

cudaMalloc( (void**)&array1, array1Size);

cudaMalloc( (void**)&array2, array2Size);

//Allocating device memory for data n result

cudaMalloc( (void**)&data, dataSize );

cudaMalloc( (void**)&result, resultSize );

//copy the arrays, data to device memory

cudaMemcpy( data, src_data, dataSize, cudaMemcpyHostToDevice );

cudaMemcpy( array1, src_array1, array1Size, cudaMemcpyHostToDevice );

cudaMemcpy( array2, src_array2, array2Size, cudaMemcpyHostToDevice );

// Initiate kernel (3000 blocks and each block has 16 threads)

dim3 dimGrid(3000,1,1);

dim3 dimBlock(16,1,1);

my_kernel<<< dimGrid, dimBlock >>>(result,data,array1,array2);

cudaThreadSynchronize();

//After processing data is written back to result in global memory

cudaMemcpy(src_result,result, resultSize, cudaMemcpyDeviceToHost );

}
__global__ void my_kernel(*result, *data, *array1, *array2)

{

//processing of data by using array1 and array2

.....

....

problem 1: Here when I read array1 and array2 from global memory it takes lot of time

problem 2: array1 and array2 are sort of look up tables, so i need to access them quite often

....

....

// Copying the processed data to result in global memory

......

Problem 3: While writing the results back to global memory, I observe that a lot of time is taken

.....

}

I need suggestions regarding memory management. Most of my time is wasted in accessing array1, array2 and writing processed data back to global memory.

Thanks in advance!

Read the sections in the programming guide describing the concept of coalesced memory access. Only with coalesced access you get peak memory bandwidth performance. Often it may be necessary to transfer data into shared memory first using coalesced access. Random access to shared memory will be much faster then.

Identify your hardware: machines with compute capability 1.2 and greater have more relaxed coalescing rules.

Random access to global memory is an order of magnitude slower than coalesced access. If you can (and size allows it), put look up tables into the 64kb of constant memory.

Christian

take a look at the developer’s guide, especially at the sections about shared memory, constant memory, texture memory and coalescing in global memory. those are exactly the things you need. ;-)

A small 256 long lookup table like yours can be copied into shared memory for best speed, or to constant memory for “OK” speed. Worst is to leave it in global memory, which is miserable for random lookups.

As Christian and Ocire said, read the programming guide about the different memory types and properties… it’s the key behavior you need to understand in CUDA for performance in almost any GPU app.

@SPWorley, @Ocire and @Christian

Thanks for your speedy replies. I followed your ideas and used constant memory and got speed up of 5x. Initially I had an kernel execution time of 30ms sec. Now with look up table in constant memory, the execution time is 6 ms. Thanks once again!

Now if I observed that the time consuming part in my kernel is - reading of data from global memory and writing of result back to global memory. Any suggestions to improve this performance?

Secondly I would request to provide a simple code that has implemented shared memory in it.

Thanks in advance!

if each of your threads only reads one value from global memory and writes back another one, shared memory won’t bring you any benefit.

you have to pay attention that your reads and writes are coalesced. (see programmer’s guide)

and use a multiple of 64 threads per block to get best performance. (if your number of registers allows for that)

you could use shared memory for your lookup tables, your kernel would then look something like that:

__global__ static void kernel(float *result, float *data, float *lut1, float *lut2){

  __shared__ float lut1_s[256]; //this is shared between all threads of a block

  __shared__ float lut2_s[256];

  if(threadIdx.x<256){

	lut1_s[threadIdx.x]=lut1[threadIdx.x];

	lut2_s[threadIdx.x]=lut2[threadIdx.x];

  }

  _syncthreads();

  //your normal calculations, just using lut1_s instead of lut1, etc.

}

this will work, if you specify at least 256 threads per block. if you have less, each thread will have to read more than 1 value.

the use of shared memory here is advantageous if you use at least a few of those lookup values in your code.

@Ocire

Awesome! I reduced 1.2 ms more in my kernel execution time! Thanks a lot! Now I’m trying out a change in my logic which should reduce 2ms more! Will get back to you ppl if I get stuck. Thanks once again. Have a great weekend!

kn: is your application really an arbitrary lookup table?

If you’re trying to optimize some function by precomputing values, you may find that a GPU’s math ops are much faster than lookup tables.

So in The Good Old Days you might apply a gamma ramp to some pixel values using a lookup table to avoid the slow powf() function, but the New Cool Way is just to use math and avoid memory accesses whenever possible.

This may not apply to your problem at all, but it’s something to keep in mind in general.

My look up table is not arbitrary. But its a mathematical expression. I tried the way you told, but it looks like its degrading the performance cuz the look up table calculation is taking time.

As you told in other cases, math expressions are much much faster than memory access

Good point made SPWorley!

Mahesh

have you tried the fast versions of the functions? meaning __powf() instead of powf(), reciprocal square root, etc.?

Unfortunately for this look up table, none of the functions except xor and shifting are used. So didnt try out other things. However for my next project I shall be using all of these.

Mahesh

if(threadIdx.x<256){

	lut1_s[threadIdx.x]=lut1[threadIdx.x];

	lut2_s[threadIdx.x]=lut2[threadIdx.x];

  }

The above code causes 256 accesses to global memory. I would prefer to have a memcpy instead of an assignment.

Why would you?