Local variables and registers

  1. How do I determine whether local variables are stored in registers?
  2. Is register memory the same thing as shared memory?
  3. Are local variables of type int2,float3,float4,etc. stored in registers/shared memory?
  4. If a kernel function calls another device function, is it inline, and where are the parameters and local variables of that device function stored?
  5. Does the compiler/thread scheduler favor storing local variables in global memory to increase occupancy?

Thanks in advance!

Look at the PTX code the compiler emits.

No. The register file and shared memory are separate areas of silicon, and their function is different. All explained in the programming guide.

Only in shared memory if you used a shared declaration for the local variable. Also explained in the programming guide.

By definition, there are no explicit arguments in an inline expanded function - it works much more like macro substitution. Local variables go to local memory or register, just like any other piece of kernel code.

The compile never stores kernel local variables in global memory. Everything goes either in local memory or registers.

  1. How do I determine whether local variables are stored in registers?
    When you are using ‘nvcc’ for compiling your code, just pass this option: ‘-Xptxas -v’. This will print how many registers are used by this kernel, what is the local memory usage (if any), what are the shared and constant memory usages.

Ok, that answers the first part of that question: shared memory is allocated only by shared keyword.

What about register memory and built-in vector types? This is actually what I was hoping to understand.

The programming guide basically states that the compiler decides what to do with local variables: either register or local memory.

My question (which the documentation apparently does not answer) is:

Does the compiler always assign built-in vector types to local memory instead of register memory (due to their large size)?

If I’m reading the -ptxas output and the .ptx file correctly, the answer is no. Could you confirm?

I’m pretty sure they stay in registers (until you run out of course). You might write a simple kernel with float4s and inspect the ptx or the cubin file.

It will try to, unless you use too many registers (there is a limit of 127 per kernel), or unless some other action is taken which forces the kernel to spill to local memory. For example, this:

__global__ void demo(const float4 *in, float4 *out)

{

	unsigned int tidx = threadIdx.x + blockIdx.x*blockDim.x;

	const float4 inc = {1.,2.,5.,7.};

	float4 val;

	

	val = in[tidx];

	val.x += inc.x;

	val.y += inc.y;

	val.z += inc.z;

	val.w += inc.w;

	out[tidx] = val;

}

compiles to this:

.entry _Z4demoPK6float4PS_ (

		.param .u64 __cudaparm__Z4demoPK6float4PS__in,

		.param .u64 __cudaparm__Z4demoPK6float4PS__out)

	{

	.reg .u16 %rh<4>;

	.reg .u32 %r<5>;

	.reg .u64 %rd<8>;

	.reg .f32 %f<14>;

	.loc	28	2	0

$LBB1__Z4demoPK6float4PS_:

	.loc	28	11	0

	mov.u16 	%rh1, %ctaid.x;

	mov.u16 	%rh2, %ntid.x;

	mul.wide.u16 	%r1, %rh1, %rh2;

	cvt.u32.u16 	%r2, %tid.x;

	add.u32 	%r3, %r2, %r1;

	cvt.u64.u32 	%rd1, %r3;

	mul.lo.u64 	%rd2, %rd1, 16;

	ld.param.u64 	%rd3, [__cudaparm__Z4demoPK6float4PS__in];

	add.u64 	%rd4, %rd3, %rd2;

	ld.global.v4.f32 	{%f1,%f2,%f3,%f4}, [%rd4+0];

	mov.f32 	%f5, 0f40000000;	 	// 2

	add.f32 	%f6, %f2, %f5;

	.loc	28	12	0

	mov.f32 	%f7, 0f40a00000;	 	// 5

	add.f32 	%f8, %f3, %f7;

	.loc	28	13	0

	mov.f32 	%f9, 0f40e00000;	 	// 7

	add.f32 	%f10, %f4, %f9;

	.loc	28	15	0

	ld.param.u64 	%rd5, [__cudaparm__Z4demoPK6float4PS__out];

	add.u64 	%rd6, %rd5, %rd2;

	mov.f32 	%f11, 0f3f800000;		// 1

	add.f32 	%f12, %f1, %f11;

	st.global.v4.f32 	[%rd6+0], {%f12,%f6,%f8,%f10};

	.loc	28	16	0

	exit;

$LDWend__Z4demoPK6float4PS_:

	} // _Z4demoPK6float4PS_

which is clearly using registers to store val.

I’m struggling with this at the moment, actually.

I have a kernel function which has ~100 local variables (of type float) and so, compiles to use 58 registers. The occupancy is low and so, in an attempt to remedy this, I replaced a whole lot of those variables with #defines - leaving me with ~15. However, the register usage is still 52!?

I can’t undestand the PTX output.

Can anyone point me in the right direction? Is there a standard approach?

What do you mean you replaced variables with #defines? Keep in mind that the number of variables you have is only loosely correlated with the number of registers your kernel uses. Compilers are smart enough to figure out when a variable is no longer needed, and reuse the register assigned to it. (Conversely, a complex expression may require additional registers for intermediate calculations.)

Also, you should realize that register assignment is done after the PTX generation stage. If you look at the PTX code, you’ll see a huge number of registers used, because the compiler emits PTX in static single assignment form. When ptxas converts the PTX to the cubin GPU machine code format, it maps the registers used in the PTX code to actual hardware registers.

Reducing register usage is tricky, and I don’t have any good heuristics. You can try forcing the compiler to spill registers to local memory with the --maxrregcount option to nvcc. Because local memory (basically global memory assigned to each thread) is much slower, this can make things worse, but it is worth a try.

Variables where only being set once and then referred to multiple times. So, as you suggest, the compiler could probably be smart about this. [But I know nothing of how compilers work and as such don’t like to trust them!]

This increases compilation time drastically! I set maxregcount to 30, to test, and my program had not compiled 1 hour later… Smaller tests (eg. maxregcount 50) produced code which ran slower.

Good to know it’s not just me! But, yes, I was really hoping for some heuristics.

Thank you!

I find that if I decrease “maxregcount” then, after a certain point, I get the infamous error:

ERROR: too many resources requested for launch

Global Memory can’t be full, can it!? I have a TESLAC1060 (with ~4GB gmem…)

Recall that my kernel natively requires 58 registers. with ‘-maxregcount=50’, the ptxas info is as follows:

ptxas info	: Compiling entry function '_Z5tT_1DPfS_S_j'

ptxas info	: Used 47 registers, 60+0 bytes lmem, 28+16 bytes smem, 352 bytes cmem[1]

I don’t know what the issue is?

What block and grid sizes are you trying to launch the kernel with?

Both blockDim and gridDim are <=512, if that’s what you’re asking?

One particular instance is: <<<20,416>>>

There is a limit of 8192 registers per MP for compute 1.0/1.1 and 16384 for compute 1.2/1.3. If you are trying to launch 416 threads per block with 47 registers per thread, that is 19552 total registers, which exceeds the register limits of any currently available version of the hardware. That is why you are getting the too many resources error.

Oh, yes, of course! I don’t know how I missed this; I thought I had it taken into account.
Thank you!