How to reduce Local Memory Usage.

mayank · November 19, 2009, 12:15am

Hi All,

I am trying to port a Molecular Dynamics Application called Nucleic Acid Builder (NAB) onto the GPU.

After profiling, I came to know that my code is using a lot of local memory, although I am not trying to use it explicitly in my code.
Using the --ptxas-options=-v while compiling showed me that the Number of Registers used is only 7 but local memory used is 736 bytes.

Is there any limit on the number of registers per thread?

My device functions have many local variables, can this be a reason of such high local memory usage?

Please give me some leads, my code is presently slower than the CPU version.

Ailleur · November 19, 2009, 12:28am

Any arrays?
if you declare float array[somevariable], it most likely gets put into local memory.

Some trigonometric functions also require local memory if you dont use the intrinsic versions but 736 bytes sounds like a lot.

Jimmy_Pettersson · November 19, 2009, 8:10am

Yes, in general you should try to avoid any arrays since they will end up in local memory ( global RAM ) which is too slow. You should instead check to see if you can place these values into shared memory.

Often you start your kernel by cramming your shared memory with as much data as you need. You then do all the work and write back to global memory.

There are however 16K of 32 bit registers that often go unused. Have a look at this thread that I posted a few days ago: [url=“http://forums.nvidia.com/index.php?showtopic=150766”]The Official NVIDIA Forums | NVIDIA.

In general though, you don’t need to do such kind of ugly register tricks but you can get along fine just using shared memory.

//jimmy

mayank · November 19, 2009, 10:34am

No I am not declaring any array in the kernel. I have to declare around 20 variables each in 2 device functions and I have been using pow() function. Can these account for the use of local memory?

Sarnath · November 19, 2009, 10:42am

How resoures get added up when using device function is pretty much un-documented… At least, the shared memory thing… Not sure if something changed in latest documentation though…

I have heard that using “volatile” keyword in front of local variables help.You may want to try it.

Jimmy_Pettersson · November 19, 2009, 1:35pm

I think the max limit of register per thread is 128 ( someone correct me here if needed ).

You can use --maxregcount=128 to set it to this value.

Anyways I’m not sure what card you are running but it is likely that you have 16384 32 bit registers. Now in your kernel maybe (register/thread)*(num_threads) exceed this number?

Try decreasing the number of threads in each block and see if the local memory issue disappears. If it’s gone you will see “normal” register/thread usage information in the ptx output…

mayank · November 19, 2009, 10:54pm

I am using GTX 280.

I have two kernels in my code. For the 1st one, registers used is 7 and local memory is 140 bytes. It has 11 local variables and no arrays. Can anyone suggest why is the register count so less.

For the 2nd kernel, number of registers used is 13 and local memory used is 208 bytes. It has 24 variables and no arrays.

On what is the register usage dependent?

Can I explicitly put some variables into the registers by using keyword like “register”.

How can I remove local memory usage?

Please help.

Ailleur · November 19, 2009, 11:32pm

I think at this point the code would help, if you can supply it.

Sarnath · November 20, 2009, 7:19am

I think at this point, you are using excessively large number of threads… Possibly OR You are using some in-built function that is drinking local memory…

Post your code OR You could comment out in-built functions and other portions of your code to see who is causing local memory spills…

Jimmy_Pettersson · November 20, 2009, 11:18am

Did you “Try decreasing the number of threads in each block and see if the local memory issue disappears.” ?

What is happening is that you are exceeding the number of registers available per SM, these are then placed in local mem.

Code would be helpful, both of kernel and how your grids and blocks are dimensioned…

mayank · November 20, 2009, 6:00pm

__global__ void nblist_hcp_dev(int *Ipcomplex, int *Ipstrand, int *Ipres, REAL_T *x, REAL_T *x_hcp1, REAL_T *x_hcp2, REAL_T *x_hcp3, int *temp_hcp0, int *temp_hcp1, int *temp_hcp2, int *temp_hcp3, int *temp_hcp0_n, int *temp_hcp1_n, int *temp_hcp2_n, int *temp_hcp3_n, int a)

{

	int c1, s1, r1, a1, s1_from, s1_to, r1_from, r1_to, a1_from, a1_to;

	REAL_T dist2;

	int index=blockIdx.x*blockDim.x + threadIdx.x;

	extern __shared__ float4 x_cord[];

	a = a + __mul24(blockIdx.x, __mul24(blockDim.x,  blockDim.y)) + __mul24(blockDim.x, threadIdx.y) + threadIdx.x;

	

	if (a>=Natom)

		return;

	x_cord[threadIdx.x].x=x[3*a+0];

	x_cord[threadIdx.x].y=x[3*a+1];

	x_cord[threadIdx.x].z=x[3*a+2];

	x_cord[threadIdx.x].w=0.0;

	__syncthreads();

		

	for (c1 = 0; c1 < Ncomplex; c1++) /* for each complex */

	{

		dist2 = 0.0;

		dist2 = calc_dist2_1(x_cord[threadIdx.x].x, x_cord[threadIdx.x].y, x_cord[threadIdx.x].z, x_hcp3[3*c1+0], x_hcp3[3*c1+1], x_hcp3[3*c1+2]);

		

		if (dist2 > dist2_hcp3)

		{

			temp_hcp3[index*Ncomplex+temp_hcp3_n[index]] = c1;

			(temp_hcp3_n[index])++;

		}

		else

		{

			s1_from = Ipcomplex[c1] - 1;

			if (c1 < Ncomplex - 1) { s1_to = Ipcomplex[c1 + 1] - 1; }

			else { s1_to = Nstrand; }

			for (s1 = s1_from; s1 < s1_to; s1++) /* for each other strand */

			{

				dist2 = 0;

				dist2 = calc_dist2_1(x_cord[threadIdx.x].x, x_cord[threadIdx.x].y, x_cord[threadIdx.x].z, x_hcp2[3*s1+0], x_hcp2[3*s1+1], x_hcp2[3*s1+2]);

		

				if (dist2 > dist2_hcp2)

				{

					temp_hcp2[index*Nstrand+temp_hcp2_n[index]] = s1;

					(temp_hcp2_n[index])++;

				}

				else

				{

					r1_from = Ipstrand[s1] - 1;

					if (s1 < Nstrand - 1) { r1_to = Ipstrand[s1 + 1] - 1; }

					else { r1_to = Nres; }

					for (r1 = r1_from; r1 < r1_to; r1++)	  /* for each other residue */

					{

						dist2 = 0;

						dist2 = calc_dist2_1(x_cord[threadIdx.x].x, x_cord[threadIdx.x].y, x_cord[threadIdx.x].z, x_hcp1[3*r1+0], x_hcp1[3*r1+1], x_hcp1[3*r1+2]);

		

						if (dist2 > dist2_hcp1)

						{

							temp_hcp1[index*Nres+temp_hcp1_n[index]] = r1;

							(temp_hcp1_n[index])++;

						}

						else

						{

							a1_from = Ipres[r1] - 1;

							if (r1 < Nres - 1) { a1_to = Ipres[r1 + 1] - 1; }

							else { a1_to = Natom; }

							for (a1 = a1_from; a1 < a1_to; a1++)	/* for each atoms */

							{

								if (a != a1)

								{

									temp_hcp0[index*Natom+temp_hcp0_n[index]] = a1;

									(temp_hcp0_n[index])++;

								}  /* not same atom */

							}  /* end for other atoms */

						}  /* end else residue inside threshold dist */

					}  /* end for other residues */

				}  /* end else strand inside threshold dist */

			}  /* end for other strand */

		} /* end else complex inside threshold dist */

	} /* end for other complex */				

}

This is the 1st kernel. Its using 7 registers and 140 bytes of lmem. I am using just 64 threads per block (1 D) and 120 blocks per grid (1 D).

The 2nd kernel is quite big. I guess if I can somehow manage to reduce lmem usage in this case, then I would be able to do it for the other one also.

Thanks a lot for all your efforts.

LSChien · November 21, 2009, 7:51am

I comment device function “calc_dist2_1” and define following quantities

typedef float REAL_T;

#define Natom 15

#define Ncomplex 20 

#define dist2_hcp1 1.7 

#define dist2_hcp3 1.5

#define dist2_hcp2 1.6

#define Nstrand 20

#define Nres 15

then resource usage is

lmem = 0

smem = 144

reg  = 15

bar  = 1

const {

		segname = const

		segnum  = 1

		offset  = 0

		bytes   = 36

}

mayank · November 23, 2009, 6:17pm

I comment device function “calc_dist2_1” and define following quantities
typedef float REAL_T;

#define Natom 15

#define Ncomplex 20 

#define dist2_hcp1 1.7 

#define dist2_hcp3 1.5

#define dist2_hcp2 1.6

#define Nstrand 20

#define Nres 15

I cannot define macros for the values like you have done because it depends on the input structure and a separate part of the program calculates these values.

And it does not answer my question of what is actually getting stored in the local memory? If I could know that then possibly I could try to reduce its usage.

LSChien · November 24, 2009, 1:16am

yes, you can pass these values from function parameters, and these values should not affect usage of local memory.

however in your code, no device function “calc_dist2_1” and above values are given,

that’s why I comment “calc_dist2_1” and define macro for above values.

Could you show content of “calc_dist2_1”? I think it is the key to this problem.

mayank · November 25, 2009, 5:37pm

Here is calc_dist2_1()

__device__ REAL_T calc_dist2_1(REAL_T xi, REAL_T yi, REAL_T zi, REAL_T xj, REAL_T yj, REAL_T zj)

{

	REAL_T dist2, xij, yij, zij;

	xij = xi - xj;

	yij = yi - yj;

	zij = zi - zj;

	dist2 = xij * xij + yij * yij + zij * zij;

	return(dist2);

}

mayank · November 29, 2009, 5:45am

Here is calc_dist2_1()

__device__ REAL_T calc_dist2_1(REAL_T xi, REAL_T yi, REAL_T zi, REAL_T xj, REAL_T yj, REAL_T zj)

{

	REAL_T dist2, xij, yij, zij;

	xij = xi - xj;

	yij = yi - yj;

	zij = zi - zj;

	dist2 = xij * xij + yij * yij + zij * zij;

	return(dist2);

}

I tried removing this function and it actually increased my local memory usage.

Can anyone please tell me what can be the possible reasons for the use of local memory.

mayank · November 30, 2009, 1:44am

Found the problem. I was using some debugging flags with nvcc and they were the roots of all evil.

Thanks a lot guys. Really appreciate it.

Sarnath · November 30, 2009, 7:18am

Can you post the name of the flags please? so that we would be more carefull… Thanks!

mayank · November 30, 2009, 3:30pm

It were the standard flags -g and -G.

Ailleur · November 30, 2009, 4:59pm

This behavior is new to me… id be interested if anyone has any more information on this.

Topic		Replies	Views
How to prevent nvcc from using local memory? CUDA Programming and Performance	16	22432	February 14, 2008
Register vs local memory Forcing NVCC to use registers CUDA Programming and Performance	4	3111	June 29, 2007
Is it possible to use more than 124 registers in kernel? CUDA Programming and Performance	15	4153	October 16, 2009
Local variables and registers CUDA Programming and Performance	13	6214	March 23, 2010
Register Limit? Compilation to .cubin using local memory CUDA Programming and Performance	5	2460	December 11, 2008
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24753	December 21, 2009
why my kernel uses local memory? CUDA Programming and Performance	9	3694	August 21, 2015
Where are variables stored in device memory? Lessen the device register count CUDA Programming and Performance	1	1625	December 7, 2007
mysterious local memory usage in my kernel CUDA Programming and Performance	2	3351	April 5, 2010
Use of register An odd problem CUDA Programming and Performance	12	2309	August 12, 2010

How to reduce Local Memory Usage.

Related topics