Some confusion on using shared memory.

CUDAkk · May 28, 2009, 7:35am

Hi Every One,

I want to fully utilize my shared memory that is 16 KB. But I have some confusion on using this.

In my code the kernel architecture is:

__device__ void device_1(unsigned int Array_1)

{

   __shared__  unsigned int  Sh_Array[256*12];  // Here I am taking size of array Sh_Array is 256*12 that is the requirement of this device function.

// Some lines of code here.

}

__device__ void device_2( unsigned int Array_1,  signed short int Array_2  )

{

	__shared__  unsigned int  Sh_Array[256*16];  // Here I am taking size of array Sh_Array is 256*16that is the requirement of this device function.

   // Some lines of code here.

}

<b>__global__ void kernel_foo( unsigned int Array_1,  signed short int Array_2 )</b>

{

  __shared__  unsigned int share_array_1[256*10];

// Some lines of codes

// Calling Device functions

device_1(Array_1);

device_2(Array_1, Array_2);

}

Kernel calling configuration is :

kernel_foo<<<1000,256>>>( Array_1, Array_2 );

My confusion is Total( global function’s and device function’s shared memory)memory should be less than equal to 16 KB or device_1’s shared memory reused by device_2’s ?

plmae · May 28, 2009, 8:03am

during execution of device_1 other threads can already be in device_2, so shared memory can’t be reused, share_array_1 and both Sh_Array’s will be allocated giving 38KB used shared memory in total.

but i can be wrong;)

plmae · May 28, 2009, 8:05am

you could allocate your shared mem in global scope and use __syncthreads() to assure that all threads use this memory for the same purpose

dlmeetei · May 28, 2009, 8:53am

device fn are always inlined. so shared memory is for all – device and global. You cant use all 16Kb mem. See this
[url=“http://forums.nvidia.com/index.php?showtopic=97534&hl=shared+memory”]http://forums.nvidia.com/index.php?showtop...l=shared+memory[/url]
Some of smem are being used by the system. Read the nvcc man, Last page

Sarnath · May 28, 2009, 8:59am

KK,

YOu ask a very valid question that was discussed in this forums before. But none actually bothered to find the answer. (or may b, I did not read it)

Use “-keep” option of NVCC , compile and check the CUBIN file. The “.cubin” file will have the resoources consumed by your kernel. Check it out.
Resources include shared memory, registers per thread, local memory, constant memory among other info.

ALso, do a “err = cudaThreadSynchronize()” after kernel call to find errors and use “cudaGetLastError” and “cudaGetErrorString” APIs to dump errors on kernel launches…

Good luck

CUDAkk · May 28, 2009, 11:45am

I have read the thread link of this forum given by you. In which Sarnath has written that:

But in my Code:

__global__ void  kernel_foo(unsigned int* Array_1, unsigned int* Array_2, unsigned int *d_y )

{

 unsigned int   Array_3[4][8][8];	  

long idx =  blockIdx.x * blockDim.x  + threadIdx.x;

long end = 1200 * 1000;

if( idx < end)

{

for (i = 0; i < 4; i++)

{

	for (l = x; l < 3; l++)

		{

	Array_3[i][0][l-x] = Array_1[d_y[y_+0]+i];

	Array_3[i][1][l-x] = Array_1[d_y[y_+1]+i];

	Array_3[i][2][l-x] = Array_1[d_y[y_+2]+i];

	Array_3[i][3][l-x] = Array_1[d_y[y_+3]+i];

	Array_3[i][4][l-x] = Array_1[d_y[y_+4]+i];

	Array_3[i][5][l-x] = Array_1[d_y[y_+5]+i];

	Array_3[i][6][l-x] = Array_1[d_y[y_+6]+i];

	Array_3[i][7][l-x] = Array_1[d_y[y_+7]+i];

		}

	 for (l = 3 - x; l < 8; l++)

		{

	int dl = idx + (l << 2) + i;

	Array_3[i][0][l] = Array_1[d_y[y_+0]+dl];

	Array_3[i][1][l] = Array_1[d_y[y_+1]+dl];

	Array_3[i][2][l] = Array_1[d_y[y_+2]+dl];

	Array_3[i][3][l] = Array_1[d_y[y_+3]+dl];

	Array_3[i][4][l] = Array_1[d_y[y_+4]+dl];

	Array_3[i][5][l] = Array_1[d_y[y_+5]+dl];

	Array_3[i][6][l] = Array_1[d_y[y_+6]+dl];

	Array_3[i][7][l] = Array_1[d_y[y_+7]+dl];

	   }

}

device_1(Array_1,Array_3);

device_2(Array_1, Array_2);   // device_1 and device_2 are device functions.

}

}

My calling configuration is :

kernel_foo<<<1000,256>>>( Array_1, Array_2, d_y );

But this kernel takes large time .So I want to use shared memory access, but I face many problems:

Array_3 takes 4882564=256 KB memeory, even with block size = 128 threads it takes much shared memory.
Array_1 are not sequential access, it depends upon array value of d_y.

[b]So, would I use shared memory for my kernel ?

If yes, then HOW ?[/b]

Sarnath · May 28, 2009, 12:46pm

btw, whatever I had quoted in that thread was for hacking. All that is NOT needed normally for CUDA computation. Please mark these words. Dont do hacking when you are learning.

Coming back,

Think of Shared Memory as a conscious cache.

Unlike CPU cache which automatically caches what you access un-consciously, the GPU cache is a conscious cache and programmer needs to use it consciously.

i.e.

Whatever data that you think you will need frequently should be staged in shared memory explicitly. After computation, store the results back in global memory and then fetch the next set from global memory to shared and do the same.

This assumes that there is always a working set which is close to one another. i.e. you can take a contiguous data into shared memory and be able to compute partial results from it.

If your working set is huge that it cant fit in shared memory, then you need to keep your data always in global memory only (may b a partial set in shared memory). For such kind of random data access pattern, you can try texture-way of accessing global memory - which is a un-conscious cache (just like CPU cache).

But for learning purposes, first try to code an algorithm normally using shared memory and learn how to use it. Then you can turn towards textures.

Cygnus_X1 · May 28, 2009, 1:19pm

I just use

–ptxas-options=-v

as an additional parameter to NVCC. This, during compilation, outputs amount of resources consumed by kernels without the need of playing with .cubin files.

The output may be somewhat confusing though. Here is my example

Used 23 registers, 24+0 bytes lmem, 10804+10800 bytes smem, 132 bytes cmem[0], 100 bytes cmem[1], 12 bytes cmem[14]

This means my kernel uses:

23 registers
24 bytes of local memory
10800 shared memory but 10804 has been allocated
244 (132+100+12) bytes of constant memory

CUDAkk · May 28, 2009, 2:46pm

But Suppose a global function has 4 device functions. Inside first device function if I use array that is in shared memory . After execution of first device function can I use this shared memory inside second device functions and so on.

As we know , device functions are inline function.So , how to efficienty use shared memory inside different device functions?

Sarnath · May 29, 2009, 2:44am

I think some1 answered. When soem threads in a block r executing device1 function, some other threads may be executing device2/device3/device4 precluding any possibility of sharing of shared memory.

However introducing a __syncthreads(), can prevent this overlap. But I am not sure what the compiler does.

Thas why we are advicing to check the cubin or ptxas -v option to see how much shared memory is occupied by your kernel with __syncthreads and without __syncthreads between all device functions. If u find sthg, pls update us.

My guess is that the shared memory will just be just ADDed - no matter what.

I am sure this was discussed b4. But I dont remember the results of that discussion

Cygnus_X1 · May 29, 2009, 7:31am

I believe shared memory inside the device function is declared similarly to ‘static’ variables in normal C functions. If you launch same function several times, you reuse the variables.
However, I think this variable cannot overlap with any other variable from different function.

I have to check it out though!

Sarnath · May 29, 2009, 8:27am

PDan looks to be right apparently. (read fully…)

Here is a small code that I wrote:

__device__ void clear(float *c, int n)

{

	__shared__ float hello[512];

	int i;

	for(i=blockIdx.x*blockDim.x + threadIdx.x; i<n; i+=blockDim.x*gridDim.x)

	{

		hello[threadIdx.x] = c[i];

		hello[threadIdx.x] += i;

		c[i] = hello[threadIdx.x];

	}

}

__global__ void doSomething(float *c, int n) 

{

	clear(c,n);

	clear(c,n);

	clear(c,n);

	clear(c,n);

	clear(c,n);

}

The kernel is launched with 512 threads per block. The amount of shared memory was 2072 always (2072+24) irrespective of how many times I call this “clear”

Even without __syncthreads() this is behaving this way.

This is clearly dangerous and possibly a bug. I use CUDA 2.2

–edit–

However when I added another device function, the amount of shared memory doubled.

So, the moral of the story is:

Shared memory is NOT sharable across device functions.

However multiple instantiations of a device function – share the same memory.

So, one needs to __syncthreads() on each device functions to avoid cross-stepping

However, there might be occasions when one wants to call device functions inside CONDITIONALs.

So, it is upto the programmer to judiciously use different instantiations of device functions.

CUDAkk · May 29, 2009, 11:19am

Hello Sarnath,

Your these points clears more my previous confusion on using shared memory, Thanks for this.

In your previous send you have written:

So I am mentioning my shared memory occupencies:

Sarnath · May 29, 2009, 11:26am

How did you get 36944 amount of shared memory when your 4 indivudal functions are using around 2K. You must get 8K. isn’t it?

Do you understand how 36944 comes there? If you understand, its fine.

CUDAkk · May 29, 2009, 11:39am

What I actually need is little bit different from your code:

__device__ void clear_1(float *c, int n)

{

	__shared__ float hello[512];

	int i;

	for(i=blockIdx.x*blockDim.x + threadIdx.x; i<n; i+=blockDim.x*gridDim.x)

	{

		hello[threadIdx.x] = c[i];

		hello[threadIdx.x] += i;

		c[i] = hello[threadIdx.x];

	}

}

__device__ void clear_2(float *c, int n)

{

	__shared__ int hello[512];

	int i;

	for(i=blockIdx.x*blockDim.x + threadIdx.x; i<n; i+=blockDim.x*gridDim.x)

	{

		hello[threadIdx.x] = c[i];

		hello[threadIdx.x] /= i;	// Some manipulation different from clear_1()

		c[i] = hello[threadIdx.x];

	}

}

__global__ void doSomething(float *c, int *e, int n) 

{

	clear_1(c,n);

	clear_2(e,n);

}

Actually I want to know , Is Total shared memory( doSomething() ) == Total shared memory( clear_1() ) + Total shared memory( clear_2() ) ?

What I understand You already mensioned that

If this is the actual picture than how to handle this shortage of shared memory ?

CUDAkk · May 29, 2009, 12:30pm

Actually one of my device function that takes much shared memory is :

# define coeff 512

__device__

void device_function(	unsigned int  Array[8][8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		 signed short int a5, signed short int a6, signed short int a7,  int* Out_array )

{

	Out_array [0] = (__mul24(a0 , Array[0][0]) + __mul24(a1 , Array[0][1]) + __mul24(a2 , Array[0][2]) + __mul24(a3 , Array[0][3]) +

		   __mul24(a4 , Array[0][4]) + __mul24(a5 , Array[0][5]) + __mul24(a6 , Array[0][6]) + __mul24(a7 , Array[0][7])+ coeff);

	Out_array [1] = (__mul24(a0 , Array[1][0]) + __mul24(a1 , Array[1][1]) + __mul24(a2 , Array[1][2] )+ __mul24(a3 , Array[1][3] )+

		   __mul24(a4 , Array[1][4]) + __mul24(a5 , Array[1][5]) + __mul24(a6 , Array[1][6] )+ __mul24(a7 , Array[1][7])+ coeff);

	Out_array [2] = (__mul24(a0 , Array[2][0]) + __mul24(a1 , Array[2][1]) + __mul24(a2 , Array[2][2]) + __mul24(a3 , Array[2][3]) +

		   __mul24(a4 , Array[2][4]) + __mul24(a5 , Array[2][5]) + __mul24(a6 , Array[2][6]) + __mul24(a7 , Array[2][7])+ coeff);

	Out_array 3] = (__mul24(a0 , Array[3][0]) + __mul24(a1 , Array[3][1]) + __mul24(a2 , Array[3][2]) + __mul24(a3 , Array[3][3]) +

		   __mul24(a4 , Array[3][4]) + __mul24(a5 , Array[3][5]) + __mul24(a6 , Array[3][6]) + __mul24(a7 , Array[3][7])+ coeff);

	Out_array [4] = (__mul24(a0 , Array[4][0]) + __mul24(a1 , Array[4][1]) + __mul24(a2 , Array[4][2]) + __mul24(a3 , Array[4][3]) +

		   __mul24(a4 , Array[4][4]) + __mul24(a5 , Array[4][5]) + __mul24(a6 , Array[4][6]) + __mul24(a7 , Array[4][7]) + coeff);

	Out_array [5] = (__mul24(a0 , Array[5][0]) + __mul24(a1 , Array[5][1]) + __mul24(a2 , Array[5][2]) + __mul24(a3 , Array[5][3]) +

		   __mul24(a4 , Array[5][4]) + __mul24(a5 , Array[5][5]) + __mul24(a6 , Array[5][6]) + __mul24(a7 , Array[5][7])+ coeff);

	Out_array [6] = (__mul24(a0 , Array[6][0]) + __mul24(a1 , Array[6][1]) + __mul24(a2 , Array[6][2]) + __mul24(a3 , Array[6][3]) +

		   __mul24(a4 , Array[6][4]) + __mul24(a5 , Array[6][5]) + __mul24(a6 , Array[6][6]) + __mul24(a7 , Array[6][7])+ coeff);

	Out_array [7] = (__mul24(a0 , Array[7][0]) + __mul24(a1 , Array[7][1]) + __mul24(a2 , Array[7][2]) + __mul24(a3 , Array[7][3]) +

		   __mul24(a4 , Array[7][4] )+ __mul24(a5 , Array[7][5]) + __mul24(a6 , Array[7][6] )+ __mul24(a7 , Array[7][7])+ coeff);

}

if I take 256 threads per block then each thread requires (64 bytes by Array + 8 bytes by Out_array) shared memory (I want to use these two arrays in shared memory).

So, total shared memory required by each block to execute this device function is (64 bytes by Array + 8 bytes by Out_array)*256 *4 bytes (since data type is int so i multiplied by 4) which is equal to 24576 bytes. If I take 64 threads per block then also this function requires 18432 bytes shared memory.

The moral is:

How to handle this problem?

Sarnath · May 29, 2009, 1:10pm

# define coeff 512

__device__

void device_function(	unsigned int  Array[8][8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		 signed short int a5, signed short int a6, signed short int a7,  int* Out_array )

{

	Out_array [0] = (__mul24(a0 , Array[0][0]) + __mul24(a1 , Array[0][1]) + __mul24(a2 , Array[0][2]) + __mul24(a3 , Array[0][3]) +

		   __mul24(a4 , Array[0][4]) + __mul24(a5 , Array[0][5]) + __mul24(a6 , Array[0][6]) + __mul24(a7 , Array[0][7])+ coeff);

	Out_array [1] = (__mul24(a0 , Array[1][0]) + __mul24(a1 , Array[1][1]) + __mul24(a2 , Array[1][2] )+ __mul24(a3 , Array[1][3] )+

		   __mul24(a4 , Array[1][4]) + __mul24(a5 , Array[1][5]) + __mul24(a6 , Array[1][6] )+ __mul24(a7 , Array[1][7])+ coeff);

	Out_array [2] = (__mul24(a0 , Array[2][0]) + __mul24(a1 , Array[2][1]) + __mul24(a2 , Array[2][2]) + __mul24(a3 , Array[2][3]) +

		   __mul24(a4 , Array[2][4]) + __mul24(a5 , Array[2][5]) + __mul24(a6 , Array[2][6]) + __mul24(a7 , Array[2][7])+ coeff);

	Out_array 3] = (__mul24(a0 , Array[3][0]) + __mul24(a1 , Array[3][1]) + __mul24(a2 , Array[3][2]) + __mul24(a3 , Array[3][3]) +

		   __mul24(a4 , Array[3][4]) + __mul24(a5 , Array[3][5]) + __mul24(a6 , Array[3][6]) + __mul24(a7 , Array[3][7])+ coeff);

	Out_array [4] = (__mul24(a0 , Array[4][0]) + __mul24(a1 , Array[4][1]) + __mul24(a2 , Array[4][2]) + __mul24(a3 , Array[4][3]) +

		   __mul24(a4 , Array[4][4]) + __mul24(a5 , Array[4][5]) + __mul24(a6 , Array[4][6]) + __mul24(a7 , Array[4][7]) + coeff);

	Out_array [5] = (__mul24(a0 , Array[5][0]) + __mul24(a1 , Array[5][1]) + __mul24(a2 , Array[5][2]) + __mul24(a3 , Array[5][3]) +

		   __mul24(a4 , Array[5][4]) + __mul24(a5 , Array[5][5]) + __mul24(a6 , Array[5][6]) + __mul24(a7 , Array[5][7])+ coeff);

	Out_array [6] = (__mul24(a0 , Array[6][0]) + __mul24(a1 , Array[6][1]) + __mul24(a2 , Array[6][2]) + __mul24(a3 , Array[6][3]) +

		   __mul24(a4 , Array[6][4]) + __mul24(a5 , Array[6][5]) + __mul24(a6 , Array[6][6]) + __mul24(a7 , Array[6][7])+ coeff);

	Out_array [7] = (__mul24(a0 , Array[7][0]) + __mul24(a1 , Array[7][1]) + __mul24(a2 , Array[7][2]) + __mul24(a3 , Array[7][3]) +

		   __mul24(a4 , Array[7][4] )+ __mul24(a5 , Array[7][5]) + __mul24(a6 , Array[7][6] )+ __mul24(a7 , Array[7][7])+ coeff);

}

I am just bewildered at this code.

Why can’t you just put it in some kind of FOR loop with a neat expression???

Readability is the single most important and desired property of a programmer.

Also, If you have shortage of shared memory because of addition of smem of device functions, declare a global shared memory in your kernel and pass the pointer to it as an argument to your device function…

The compiler would (should) be smart enough to inline it corerctly (although there are some compiler quirks – I dont want confuse u now. You will know when u hit the advisory warning)

CUDAkk · May 29, 2009, 1:35pm

I am trying that I will replay the result of that implementation after few moments…

CUDAkk · May 29, 2009, 2:01pm

I have Implemented my Device functions in two ways:

FIRST way:

extern __shared__ char sh_array[];

__device__

void device_function(  unsigned int  Array[8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		 signed short int a5, signed short int a6, signed short int a7,  int* Out_array )

{

	int tid  = threadIdx.x;

	unsigned char *RA_array = (unsigned char *)&sh_array;

	for(int i =0; i<8; ++i)

	{

		RA_array [8*tid+i] = Array[i];

	}

	

	(*Out_array)  =( __mul24(a0 , RA_array [8*tid+0]) + __mul24(a1 , RA_array [8*tid+1]) + __mul24(a2 , RA_array [8*tid+2]) + __mul24(a3 , RA_array [8*tid+3]) +

		   __mul24(a4 , RA_array [8*tid+4]) + __mul24(a5 , RA_array [8*tid+5]) + __mul24(a6 , RA_array [8*tid+6]) + __mul24(a7 , RA_array [8*tid+7])+coeff);

	

	 

}

But this doesnot give any improvement in execution time. Also the output is damaged.

SECOND way of same device function :

__device__

void device_function(	unsigned int  Array[8], signed short int a0, signed short int a1, signed short int a2, signed short int a3, signed short int a4, 

		signed short int a5, signed short int a6, signed short int a7,  int* Out_array, int tid )	  // tid is threadIdx.x here.

{

	__shared__ unsigned char RA_array [256*8];

	__shared__ signed short int RA_Out_array[256];																																  // 256 is number of threads per blocks.

	for(int i =0; i<8; ++i)

	{

		RA_array [8*tid+i] = rgb[i];

	}

	

	RA_Out_array[tid] = (__mul24(a0 , RA_array [8*tid+0]) + __mul24(a1 , RA_array [8*tid+1]) + __mul24(a2 , RA_array [8*tid+2]) + __mul24(a3 , RA_array [8*tid+3]) +

		   __mul24(a4 , RA_array [8*tid+4]) + __mul24(a5 , RA_array [8*tid+5]) + __mul24(a6 , RA_array [8*tid+6]) + __mul24(a7 , RA_array [8*tid+7])+ coeff);

	(* Out_array ) = RA_Out_array[tid];

	

}

This gives correct ouput but execution times same as before implementing this.

I canot see here any improvement on execution time if I use shared memory. WHY?

CUDAkk · May 31, 2009, 8:26am

Please help for above problem. :mellow:

Topic		Replies	Views
Shared memory problem CUDA Programming and Performance	10	4032	April 20, 2010
why is shared memory example not faster CUDA Programming and Performance	7	1370	May 16, 2012
help getting shared memory working CUDA Programming and Performance	11	4367	June 12, 2007
How to use all 16KB shared memory CUDA Programming and Performance	39	19546	April 1, 2010
shared memory and device function CUDA Programming and Performance	1	1835	April 21, 2011
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15117	December 17, 2008
Shared memory question CUDA Programming and Performance	27	7528	June 23, 2008
__syncthreads and shared memory CUDA Programming and Performance	21	4508	June 15, 2011
shared memory example to be found easy example for vector dot product CUDA Programming and Performance	18	3746	September 17, 2010
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9530	April 6, 2010

Some confusion on using shared memory.

Related topics