Some confusion on using shared memory.

Hi Every One,

I want to fully utilize my shared memory that is 16 KB. But I have some confusion on using this.

In my code the kernel architecture is:

__device__ void device_1(unsigned int Array_1)

{

   __shared__  unsigned int  Sh_Array[256*12];  // Here I am taking size of array Sh_Array is 256*12 that is the requirement of this device function.

// Some lines of code here.

}

__device__ void device_2( unsigned int Array_1,  signed short int Array_2  )

{

	__shared__  unsigned int  Sh_Array[256*16];  // Here I am taking size of array Sh_Array is 256*16that is the requirement of this device function.

   // Some lines of code here.

}

<b>__global__ void kernel_foo( unsigned int Array_1,  signed short int Array_2 )</b>

{

  __shared__  unsigned int share_array_1[256*10];

// Some lines of codes

// Calling Device functions

device_1(Array_1);

device_2(Array_1, Array_2);

}

Kernel calling configuration is :

kernel_foo<<<1000,256>>>( Array_1, Array_2 );

My confusion is Total( global function’s and device function’s shared memory)memory should be less than equal to 16 KB or device_1’s shared memory reused by device_2’s ?

during execution of device_1 other threads can already be in device_2, so shared memory can’t be reused, share_array_1 and both Sh_Array’s will be allocated giving 38KB used shared memory in total.

but i can be wrong;)

you could allocate your shared mem in global scope and use __syncthreads() to assure that all threads use this memory for the same purpose

device fn are always inlined. so shared memory is for all – device and global. You cant use all 16Kb mem. See this
http://forums.nvidia.com/index.php?showtop…l=shared+memory
Some of smem are being used by the system. Read the nvcc man, Last page

KK,

YOu ask a very valid question that was discussed in this forums before. But none actually bothered to find the answer. (or may b, I did not read it)

Use “-keep” option of NVCC , compile and check the CUBIN file. The “.cubin” file will have the resoources consumed by your kernel. Check it out.
Resources include shared memory, registers per thread, local memory, constant memory among other info.

ALso, do a “err = cudaThreadSynchronize()” after kernel call to find errors and use “cudaGetLastError” and “cudaGetErrorString” APIs to dump errors on kernel launches…

Good luck

I have read the thread link of this forum given by you. In which Sarnath has written that:

But in my Code:

__global__ void  kernel_foo(unsigned int* Array_1, unsigned int* Array_2, unsigned int *d_y )

{

 unsigned int   Array_3[4][8][8];	  

long idx =  blockIdx.x * blockDim.x  + threadIdx.x;

long end = 1200 * 1000;

if( idx < end)

{

for (i = 0; i < 4; i++)

{

	for (l = x; l < 3; l++)

		{

	Array_3[i][0][l-x] = Array_1[d_y[y_+0]+i];

	Array_3[i][1][l-x] = Array_1[d_y[y_+1]+i];

	Array_3[i][2][l-x] = Array_1[d_y[y_+2]+i];

	Array_3[i][3][l-x] = Array_1[d_y[y_+3]+i];

	Array_3[i][4][l-x] = Array_1[d_y[y_+4]+i];

	Array_3[i][5][l-x] = Array_1[d_y[y_+5]+i];

	Array_3[i][6][l-x] = Array_1[d_y[y_+6]+i];

	Array_3[i][7][l-x] = Array_1[d_y[y_+7]+i];

		}

	 for (l = 3 - x; l < 8; l++)

		{

	int dl = idx + (l << 2) + i;

	Array_3[i][0][l] = Array_1[d_y[y_+0]+dl];

	Array_3[i][1][l] = Array_1[d_y[y_+1]+dl];

	Array_3[i][2][l] = Array_1[d_y[y_+2]+dl];

	Array_3[i][3][l] = Array_1[d_y[y_+3]+dl];

	Array_3[i][4][l] = Array_1[d_y[y_+4]+dl];

	Array_3[i][5][l] = Array_1[d_y[y_+5]+dl];

	Array_3[i][6][l] = Array_1[d_y[y_+6]+dl];

	Array_3[i][7][l] = Array_1[d_y[y_+7]+dl];

	   }

}

device_1(Array_1,Array_3);

device_2(Array_1, Array_2);   // device_1 and device_2 are device functions.

}

}

My calling configuration is :

kernel_foo<<<1000,256>>>( Array_1, Array_2, d_y );

But this kernel takes large time .So I want to use shared memory access, but I face many problems:

  1. Array_3 takes 4882564=256 KB memeory, even with block size = 128 threads it takes much shared memory.

  2. Array_1 are not sequential access, it depends upon array value of d_y.

[b]So, would I use shared memory for my kernel ?

If yes, then HOW ?[/b]

btw, whatever I had quoted in that thread was for hacking. All that is NOT needed normally for CUDA computation. Please mark these words. Dont do hacking when you are learning.

Coming back,

Think of Shared Memory as a conscious cache.

Unlike CPU cache which automatically caches what you access un-consciously, the GPU cache is a conscious cache and programmer needs to use it consciously.

i.e.

Whatever data that you think you will need frequently should be staged in shared memory explicitly. After computation, store the results back in global memory and then fetch the next set from global memory to shared and do the same.

This assumes that there is always a working set which is close to one another. i.e. you can take a contiguous data into shared memory and be able to compute partial results from it.

If your working set is huge that it cant fit in shared memory, then you need to keep your data always in global memory only (may b a partial set in shared memory). For such kind of random data access pattern, you can try texture-way of accessing global memory - which is a un-conscious cache (just like CPU cache).

But for learning purposes, first try to code an algorithm normally using shared memory and learn how to use it. Then you can turn towards textures.

I just use

–ptxas-options=-v

as an additional parameter to NVCC. This, during compilation, outputs amount of resources consumed by kernels without the need of playing with .cubin files.

The output may be somewhat confusing though. Here is my example

Used 23 registers, 24+0 bytes lmem, 10804+10800 bytes smem, 132 bytes cmem[0], 100 bytes cmem[1], 12 bytes cmem[14]

This means my kernel uses:

  • 23 registers

  • 24 bytes of local memory

  • 10800 shared memory but 10804 has been allocated

  • 244 (132+100+12) bytes of constant memory

But Suppose a global function has 4 device functions. Inside first device function if I use array that is in shared memory . After execution of first device function can I use this shared memory inside second device functions and so on.

As we know , device functions are inline function.So , how to efficienty use shared memory inside different device functions?

I think some1 answered. When soem threads in a block r executing device1 function, some other threads may be executing device2/device3/device4 precluding any possibility of sharing of shared memory.

However introducing a __syncthreads(), can prevent this overlap. But I am not sure what the compiler does.

Thas why we are advicing to check the cubin or ptxas -v option to see how much shared memory is occupied by your kernel with __syncthreads and without __syncthreads between all device functions. If u find sthg, pls update us.

My guess is that the shared memory will just be just ADDed - no matter what.

I am sure this was discussed b4. But I dont remember the results of that discussion

I believe shared memory inside the device function is declared similarly to ‘static’ variables in normal C functions. If you launch same function several times, you reuse the variables.
However, I think this variable cannot overlap with any other variable from different function.

I have to check it out though!

PDan looks to be right apparently. (read fully…)

Here is a small code that I wrote:

__device__ void clear(float *c, int n)

{

	__shared__ float hello[512];

	int i;

	for(i=blockIdx.x*blockDim.x + threadIdx.x; i<n; i+=blockDim.x*gridDim.x)

	{

		hello[threadIdx.x] = c[i];

		hello[threadIdx.x] += i;

		c[i] = hello[threadIdx.x];

	}

}

__global__ void doSomething(float *c, int n) 

{

	clear(c,n);

	clear(c,n);

	clear(c,n);

	clear(c,n);

	clear(c,n);

}

The kernel is launched with 512 threads per block. The amount of shared memory was 2072 always (2072+24) irrespective of how many times I call this “clear”

Even without __syncthreads() this is behaving this way.

This is clearly dangerous and possibly a bug. I use CUDA 2.2

–edit–

However when I added another device function, the amount of shared memory doubled.

So, the moral of the story is:

Shared memory is NOT sharable across device functions.

However multiple instantiations of a device function – share the same memory.

So, one needs to __syncthreads() on each device functions to avoid cross-stepping

However, there might be occasions when one wants to call device functions inside CONDITIONALs.

So, it is upto the programmer to judiciously use different instantiations of device functions.

Hello Sarnath,

Your these points clears more my previous confusion on using shared memory, Thanks for this.

In your previous send you have written:

So I am mentioning my shared memory occupencies:

How did you get 36944 amount of shared memory when your 4 indivudal functions are using around 2K. You must get 8K. isn’t it?

Do you understand how 36944 comes there? If you understand, its fine.

What I actually need is little bit different from your code:

__device__ void clear_1(float *c, int n)

{

	__shared__ float hello[512];

	int i;

	for(i=blockIdx.x*blockDim.x + threadIdx.x; i<n; i+=blockDim.x*gridDim.x)

	{

		hello[threadIdx.x] = c[i];

		hello[threadIdx.x] += i;

		c[i] = hello[threadIdx.x];

	}

}

__device__ void clear_2(float *c, int n)

{

	__shared__ int hello[512];

	int i;

	for(i=blockIdx.x*blockDim.x + threadIdx.x; i<n; i+=blockDim.x*gridDim.x)

	{

		hello[threadIdx.x] = c[i];

		hello[threadIdx.x] /= i;	// Some manipulation different from clear_1()

		c[i] = hello[threadIdx.x];

	}

}

__global__ void doSomething(float *c, int *e, int n) 

{

	clear_1(c,n);

	clear_2(e,n);

}

Actually I want to know , Is Total shared memory( doSomething() ) == Total shared memory( clear_1() ) + Total shared memory( clear_2() ) ?

What I understand You already mensioned that

If this is the actual picture than how to handle this shortage of shared memory ?

Actually one of my device function that takes much shared memory is :

# define coeff 512

__device__

void device_function(	unsigned int  Array[8][8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		 signed short int a5, signed short int a6, signed short int a7,  int* Out_array )

{

	Out_array [0] = (__mul24(a0 , Array[0][0]) + __mul24(a1 , Array[0][1]) + __mul24(a2 , Array[0][2]) + __mul24(a3 , Array[0][3]) +

		   __mul24(a4 , Array[0][4]) + __mul24(a5 , Array[0][5]) + __mul24(a6 , Array[0][6]) + __mul24(a7 , Array[0][7])+ coeff);

	Out_array [1] = (__mul24(a0 , Array[1][0]) + __mul24(a1 , Array[1][1]) + __mul24(a2 , Array[1][2] )+ __mul24(a3 , Array[1][3] )+

		   __mul24(a4 , Array[1][4]) + __mul24(a5 , Array[1][5]) + __mul24(a6 , Array[1][6] )+ __mul24(a7 , Array[1][7])+ coeff);

	Out_array [2] = (__mul24(a0 , Array[2][0]) + __mul24(a1 , Array[2][1]) + __mul24(a2 , Array[2][2]) + __mul24(a3 , Array[2][3]) +

		   __mul24(a4 , Array[2][4]) + __mul24(a5 , Array[2][5]) + __mul24(a6 , Array[2][6]) + __mul24(a7 , Array[2][7])+ coeff);

	Out_array 3] = (__mul24(a0 , Array[3][0]) + __mul24(a1 , Array[3][1]) + __mul24(a2 , Array[3][2]) + __mul24(a3 , Array[3][3]) +

		   __mul24(a4 , Array[3][4]) + __mul24(a5 , Array[3][5]) + __mul24(a6 , Array[3][6]) + __mul24(a7 , Array[3][7])+ coeff);

	Out_array [4] = (__mul24(a0 , Array[4][0]) + __mul24(a1 , Array[4][1]) + __mul24(a2 , Array[4][2]) + __mul24(a3 , Array[4][3]) +

		   __mul24(a4 , Array[4][4]) + __mul24(a5 , Array[4][5]) + __mul24(a6 , Array[4][6]) + __mul24(a7 , Array[4][7]) + coeff);

	Out_array [5] = (__mul24(a0 , Array[5][0]) + __mul24(a1 , Array[5][1]) + __mul24(a2 , Array[5][2]) + __mul24(a3 , Array[5][3]) +

		   __mul24(a4 , Array[5][4]) + __mul24(a5 , Array[5][5]) + __mul24(a6 , Array[5][6]) + __mul24(a7 , Array[5][7])+ coeff);

	Out_array [6] = (__mul24(a0 , Array[6][0]) + __mul24(a1 , Array[6][1]) + __mul24(a2 , Array[6][2]) + __mul24(a3 , Array[6][3]) +

		   __mul24(a4 , Array[6][4]) + __mul24(a5 , Array[6][5]) + __mul24(a6 , Array[6][6]) + __mul24(a7 , Array[6][7])+ coeff);

	Out_array [7] = (__mul24(a0 , Array[7][0]) + __mul24(a1 , Array[7][1]) + __mul24(a2 , Array[7][2]) + __mul24(a3 , Array[7][3]) +

		   __mul24(a4 , Array[7][4] )+ __mul24(a5 , Array[7][5]) + __mul24(a6 , Array[7][6] )+ __mul24(a7 , Array[7][7])+ coeff);

}

if I take 256 threads per block then each thread requires (64 bytes by Array + 8 bytes by Out_array) shared memory (I want to use these two arrays in shared memory).

So, total shared memory required by each block to execute this device function is (64 bytes by Array + 8 bytes by Out_array)*256 *4 bytes (since data type is int so i multiplied by 4) which is equal to 24576 bytes. If I take 64 threads per block then also this function requires 18432 bytes shared memory.

The moral is:

How to handle this problem?

# define coeff 512

__device__

void device_function(	unsigned int  Array[8][8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		 signed short int a5, signed short int a6, signed short int a7,  int* Out_array )

{

	Out_array [0] = (__mul24(a0 , Array[0][0]) + __mul24(a1 , Array[0][1]) + __mul24(a2 , Array[0][2]) + __mul24(a3 , Array[0][3]) +

		   __mul24(a4 , Array[0][4]) + __mul24(a5 , Array[0][5]) + __mul24(a6 , Array[0][6]) + __mul24(a7 , Array[0][7])+ coeff);

	Out_array [1] = (__mul24(a0 , Array[1][0]) + __mul24(a1 , Array[1][1]) + __mul24(a2 , Array[1][2] )+ __mul24(a3 , Array[1][3] )+

		   __mul24(a4 , Array[1][4]) + __mul24(a5 , Array[1][5]) + __mul24(a6 , Array[1][6] )+ __mul24(a7 , Array[1][7])+ coeff);

	Out_array [2] = (__mul24(a0 , Array[2][0]) + __mul24(a1 , Array[2][1]) + __mul24(a2 , Array[2][2]) + __mul24(a3 , Array[2][3]) +

		   __mul24(a4 , Array[2][4]) + __mul24(a5 , Array[2][5]) + __mul24(a6 , Array[2][6]) + __mul24(a7 , Array[2][7])+ coeff);

	Out_array 3] = (__mul24(a0 , Array[3][0]) + __mul24(a1 , Array[3][1]) + __mul24(a2 , Array[3][2]) + __mul24(a3 , Array[3][3]) +

		   __mul24(a4 , Array[3][4]) + __mul24(a5 , Array[3][5]) + __mul24(a6 , Array[3][6]) + __mul24(a7 , Array[3][7])+ coeff);

	Out_array [4] = (__mul24(a0 , Array[4][0]) + __mul24(a1 , Array[4][1]) + __mul24(a2 , Array[4][2]) + __mul24(a3 , Array[4][3]) +

		   __mul24(a4 , Array[4][4]) + __mul24(a5 , Array[4][5]) + __mul24(a6 , Array[4][6]) + __mul24(a7 , Array[4][7]) + coeff);

	Out_array [5] = (__mul24(a0 , Array[5][0]) + __mul24(a1 , Array[5][1]) + __mul24(a2 , Array[5][2]) + __mul24(a3 , Array[5][3]) +

		   __mul24(a4 , Array[5][4]) + __mul24(a5 , Array[5][5]) + __mul24(a6 , Array[5][6]) + __mul24(a7 , Array[5][7])+ coeff);

	Out_array [6] = (__mul24(a0 , Array[6][0]) + __mul24(a1 , Array[6][1]) + __mul24(a2 , Array[6][2]) + __mul24(a3 , Array[6][3]) +

		   __mul24(a4 , Array[6][4]) + __mul24(a5 , Array[6][5]) + __mul24(a6 , Array[6][6]) + __mul24(a7 , Array[6][7])+ coeff);

	Out_array [7] = (__mul24(a0 , Array[7][0]) + __mul24(a1 , Array[7][1]) + __mul24(a2 , Array[7][2]) + __mul24(a3 , Array[7][3]) +

		   __mul24(a4 , Array[7][4] )+ __mul24(a5 , Array[7][5]) + __mul24(a6 , Array[7][6] )+ __mul24(a7 , Array[7][7])+ coeff);

}

I am just bewildered at this code.

Why can’t you just put it in some kind of FOR loop with a neat expression???

Readability is the single most important and desired property of a programmer.

Also, If you have shortage of shared memory because of addition of smem of device functions, declare a global shared memory in your kernel and pass the pointer to it as an argument to your device function…

The compiler would (should) be smart enough to inline it corerctly (although there are some compiler quirks – I dont want confuse u now. You will know when u hit the advisory warning)

I am trying that I will replay the result of that implementation after few moments…

I have Implemented my Device functions in two ways:

FIRST way:

extern __shared__ char sh_array[];

__device__

void device_function(  unsigned int  Array[8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		 signed short int a5, signed short int a6, signed short int a7,  int* Out_array )

{

	int tid  = threadIdx.x;

	unsigned char *RA_array = (unsigned char *)&sh_array;

	for(int i =0; i<8; ++i)

	{

		RA_array [8*tid+i] = Array[i];

	}

	

	(*Out_array)  =( __mul24(a0 , RA_array [8*tid+0]) + __mul24(a1 , RA_array [8*tid+1]) + __mul24(a2 , RA_array [8*tid+2]) + __mul24(a3 , RA_array [8*tid+3]) +

		   __mul24(a4 , RA_array [8*tid+4]) + __mul24(a5 , RA_array [8*tid+5]) + __mul24(a6 , RA_array [8*tid+6]) + __mul24(a7 , RA_array [8*tid+7])+coeff);

	

	 

}

But this doesnot give any improvement in execution time. Also the output is damaged.

SECOND way of same device function :

__device__

void device_function(	unsigned int  Array[8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		signed short int a5, signed short int a6, signed short int a7,  int* Out_array, int tid )	  // tid is threadIdx.x here.

{

	__shared__ unsigned char RA_array [256*8];

	__shared__ signed short int RA_Out_array[256];																																  // 256 is number of threads per blocks.

	for(int i =0; i<8; ++i)

	{

		RA_array [8*tid+i] = rgb[i];

	}

	

	RA_Out_array[tid] = (__mul24(a0 , RA_array [8*tid+0]) + __mul24(a1 , RA_array [8*tid+1]) + __mul24(a2 , RA_array [8*tid+2]) + __mul24(a3 , RA_array [8*tid+3]) +

		   __mul24(a4 , RA_array [8*tid+4]) + __mul24(a5 , RA_array [8*tid+5]) + __mul24(a6 , RA_array [8*tid+6]) + __mul24(a7 , RA_array [8*tid+7])+ coeff);

	(* Out_array ) = RA_Out_array[tid];

	

}

This gives correct ouput but execution times same as before implementing this.

I canot see here any improvement on execution time if I use shared memory. WHY?

Please help for above problem. :mellow: