How to reduce Local Memory Usage.

Hi All,

I am trying to port a Molecular Dynamics Application called Nucleic Acid Builder (NAB) onto the GPU.

After profiling, I came to know that my code is using a lot of local memory, although I am not trying to use it explicitly in my code.
Using the --ptxas-options=-v while compiling showed me that the Number of Registers used is only 7 but local memory used is 736 bytes.

Is there any limit on the number of registers per thread?

My device functions have many local variables, can this be a reason of such high local memory usage?

Please give me some leads, my code is presently slower than the CPU version.

Any arrays?
if you declare float array[somevariable], it most likely gets put into local memory.

Some trigonometric functions also require local memory if you dont use the intrinsic versions but 736 bytes sounds like a lot.

Yes, in general you should try to avoid any arrays since they will end up in local memory ( global RAM ) which is too slow. You should instead check to see if you can place these values into shared memory.

Often you start your kernel by cramming your shared memory with as much data as you need. You then do all the work and write back to global memory.

There are however 16K of 32 bit registers that often go unused. Have a look at this thread that I posted a few days ago: [url=“http://forums.nvidia.com/index.php?showtopic=150766”]The Official NVIDIA Forums | NVIDIA.

In general though, you don’t need to do such kind of ugly register tricks but you can get along fine just using shared memory.

//jimmy

No I am not declaring any array in the kernel. I have to declare around 20 variables each in 2 device functions and I have been using pow() function. Can these account for the use of local memory?

How resoures get added up when using device function is pretty much un-documented… At least, the shared memory thing… Not sure if something changed in latest documentation though…

I have heard that using “volatile” keyword in front of local variables help.You may want to try it.

I think the max limit of register per thread is 128 ( someone correct me here if needed ).

You can use --maxregcount=128 to set it to this value.

Anyways I’m not sure what card you are running but it is likely that you have 16384 32 bit registers. Now in your kernel maybe (register/thread)*(num_threads) exceed this number?

Try decreasing the number of threads in each block and see if the local memory issue disappears. If it’s gone you will see “normal” register/thread usage information in the ptx output…

I am using GTX 280.

I have two kernels in my code. For the 1st one, registers used is 7 and local memory is 140 bytes. It has 11 local variables and no arrays. Can anyone suggest why is the register count so less.

For the 2nd kernel, number of registers used is 13 and local memory used is 208 bytes. It has 24 variables and no arrays.

On what is the register usage dependent?

Can I explicitly put some variables into the registers by using keyword like “register”.

How can I remove local memory usage?

Please help.

I think at this point the code would help, if you can supply it.

I think at this point, you are using excessively large number of threads… Possibly OR You are using some in-built function that is drinking local memory…

Post your code OR You could comment out in-built functions and other portions of your code to see who is causing local memory spills…

Did you “Try decreasing the number of threads in each block and see if the local memory issue disappears.” ?

What is happening is that you are exceeding the number of registers available per SM, these are then placed in local mem.

Code would be helpful, both of kernel and how your grids and blocks are dimensioned…

__global__ void nblist_hcp_dev(int *Ipcomplex, int *Ipstrand, int *Ipres, REAL_T *x, REAL_T *x_hcp1, REAL_T *x_hcp2, REAL_T *x_hcp3, int *temp_hcp0, int *temp_hcp1, int *temp_hcp2, int *temp_hcp3, int *temp_hcp0_n, int *temp_hcp1_n, int *temp_hcp2_n, int *temp_hcp3_n, int a)

{

	int c1, s1, r1, a1, s1_from, s1_to, r1_from, r1_to, a1_from, a1_to;

	REAL_T dist2;

	int index=blockIdx.x*blockDim.x + threadIdx.x;

	extern __shared__ float4 x_cord[];

	a = a + __mul24(blockIdx.x, __mul24(blockDim.x,  blockDim.y)) + __mul24(blockDim.x, threadIdx.y) + threadIdx.x;

	

	if (a>=Natom)

		return;

	x_cord[threadIdx.x].x=x[3*a+0];

	x_cord[threadIdx.x].y=x[3*a+1];

	x_cord[threadIdx.x].z=x[3*a+2];

	x_cord[threadIdx.x].w=0.0;

	__syncthreads();

		

	for (c1 = 0; c1 < Ncomplex; c1++) /* for each complex */

	{

		dist2 = 0.0;

		dist2 = calc_dist2_1(x_cord[threadIdx.x].x, x_cord[threadIdx.x].y, x_cord[threadIdx.x].z, x_hcp3[3*c1+0], x_hcp3[3*c1+1], x_hcp3[3*c1+2]);

		

		if (dist2 > dist2_hcp3)

		{

			temp_hcp3[index*Ncomplex+temp_hcp3_n[index]] = c1;

			(temp_hcp3_n[index])++;

		}

		else

		{

			s1_from = Ipcomplex[c1] - 1;

			if (c1 < Ncomplex - 1) { s1_to = Ipcomplex[c1 + 1] - 1; }

			else { s1_to = Nstrand; }

			for (s1 = s1_from; s1 < s1_to; s1++) /* for each other strand */

			{

				dist2 = 0;

				dist2 = calc_dist2_1(x_cord[threadIdx.x].x, x_cord[threadIdx.x].y, x_cord[threadIdx.x].z, x_hcp2[3*s1+0], x_hcp2[3*s1+1], x_hcp2[3*s1+2]);

		

				if (dist2 > dist2_hcp2)

				{

					temp_hcp2[index*Nstrand+temp_hcp2_n[index]] = s1;

					(temp_hcp2_n[index])++;

				}

				else

				{

					r1_from = Ipstrand[s1] - 1;

					if (s1 < Nstrand - 1) { r1_to = Ipstrand[s1 + 1] - 1; }

					else { r1_to = Nres; }

					for (r1 = r1_from; r1 < r1_to; r1++)	  /* for each other residue */

					{

						dist2 = 0;

						dist2 = calc_dist2_1(x_cord[threadIdx.x].x, x_cord[threadIdx.x].y, x_cord[threadIdx.x].z, x_hcp1[3*r1+0], x_hcp1[3*r1+1], x_hcp1[3*r1+2]);

		

						if (dist2 > dist2_hcp1)

						{

							temp_hcp1[index*Nres+temp_hcp1_n[index]] = r1;

							(temp_hcp1_n[index])++;

						}

						else

						{

							a1_from = Ipres[r1] - 1;

							if (r1 < Nres - 1) { a1_to = Ipres[r1 + 1] - 1; }

							else { a1_to = Natom; }

							for (a1 = a1_from; a1 < a1_to; a1++)	/* for each atoms */

							{

								if (a != a1)

								{

									temp_hcp0[index*Natom+temp_hcp0_n[index]] = a1;

									(temp_hcp0_n[index])++;

								}  /* not same atom */

							}  /* end for other atoms */

						}  /* end else residue inside threshold dist */

					}  /* end for other residues */

				}  /* end else strand inside threshold dist */

			}  /* end for other strand */

		} /* end else complex inside threshold dist */

	} /* end for other complex */				

}

This is the 1st kernel. Its using 7 registers and 140 bytes of lmem. I am using just 64 threads per block (1 D) and 120 blocks per grid (1 D).

The 2nd kernel is quite big. I guess if I can somehow manage to reduce lmem usage in this case, then I would be able to do it for the other one also.

Thanks a lot for all your efforts.

I comment device function “calc_dist2_1” and define following quantities

typedef float REAL_T;

#define Natom 15

#define Ncomplex 20 

#define dist2_hcp1 1.7 

#define dist2_hcp3 1.5

#define dist2_hcp2 1.6

#define Nstrand 20

#define Nres 15

then resource usage is

lmem = 0

smem = 144

reg  = 15

bar  = 1

const {

		segname = const

		segnum  = 1

		offset  = 0

		bytes   = 36

}

I cannot define macros for the values like you have done because it depends on the input structure and a separate part of the program calculates these values.

And it does not answer my question of what is actually getting stored in the local memory? If I could know that then possibly I could try to reduce its usage.

yes, you can pass these values from function parameters, and these values should not affect usage of local memory.

however in your code, no device function “calc_dist2_1” and above values are given,

that’s why I comment “calc_dist2_1” and define macro for above values.

Could you show content of “calc_dist2_1”? I think it is the key to this problem.

Here is calc_dist2_1()

__device__ REAL_T calc_dist2_1(REAL_T xi, REAL_T yi, REAL_T zi, REAL_T xj, REAL_T yj, REAL_T zj)

{

	REAL_T dist2, xij, yij, zij;

	xij = xi - xj;

	yij = yi - yj;

	zij = zi - zj;

	dist2 = xij * xij + yij * yij + zij * zij;

	return(dist2);

}

I tried removing this function and it actually increased my local memory usage.

Can anyone please tell me what can be the possible reasons for the use of local memory.

Found the problem. I was using some debugging flags with nvcc and they were the roots of all evil.

Thanks a lot guys. Really appreciate it.

Can you post the name of the flags please? so that we would be more carefull… Thanks!

It were the standard flags -g and -G.

This behavior is new to me… id be interested if anyone has any more information on this.