Global memory vs register storage How to force the compiler to use registers?

TL1 · June 22, 2009, 1:13pm

In a device function, I am allocating some local storage.

struct mystruct

{

   ushort2 xy;

   float2 st;

};

__device__ void foo()

{

   mystruct local[10];

   ...

}

When I compile using nvcc 2.2, the storage is clearly allocated as global device memory. Using compiler option -Xptxas=-v I can see the lmem storage vary as I change the size of the array.

But I don’t want to use global device memory because it’s too slow for frequent access. Instead I want to use registers. My target device is Tesla S1070 and I believe I have registers to spare.

How can I encourage the compiler to use registers rather than slow device memory for this array?

Changing data type and alignment does not seem to help.

The general problem I am having is that each thread needs one or two hundred bytes of read-write memory. Shared memory is too small and global memory is too slow. I guess this must be a frequent problem – what are standard solutions please?

Note: I have been blocked on this issue for some time and would really appreciate some help.

Thanks!

Quoc_Vinh · June 22, 2009, 1:34pm

In a device function, I am allocating some local storage.
struct mystruct

{

   ushort2 xy;

   float2 st;

};

__device__ void foo()

{

   mystruct local[10];

   ...

}
When I compile using nvcc 2.2, the storage is clearly allocated as global device memory. Using compiler option -Xptxas=-v I can see the lmem storage vary as I change the size of the array.

But I don’t want to use global device memory because it’s too slow for frequent access. Instead I want to use registers. My target device is Tesla S1070 and I believe I have registers to spare.

How can I encourage the compiler to use registers rather than slow device memory for this array?

Changing data type and alignment does not seem to help.

The general problem I am having is that each thread needs one or two hundred bytes of read-write memory. Shared memory is too small and global memory is too slow. I guess this must be a frequent problem – what are standard solutions please?

Note: I have been blocked on this issue for some time and would really appreciate some help.

Thanks!

I don’t know how to control register and how to force the compiler use register neither local memory.

The compiler by itself will decide with memory will be use, in your case your structure and array take alot of memory, so i think compiler decided use local memory to allocate your array.

The number of register is limited, 8192 for all thread on each block.

for instance, if you have 256 threads per block. so 1 thread only has 32 registers to use.

consider with your case, assuming that you have 256 threads per block. sizeof(mystruct) * 10 = 30 registers, so how can the kernel execute another purpose with only 2 registers.

TL1 · June 22, 2009, 1:53pm

Even if I reduce the size of the array, as in…

__device__ void foo()

{

   mystruct local[1];

}

… the array is still placed in local memory. Also, reducing the number of threads per block has no effect.

I need a way to tell the compiler to use registers. The ‘register’ keyword is ignored.

Quoc_Vinh · June 22, 2009, 2:17pm

__device__ void foo()

{

   mystruct local[1];

}

I am not sure the compiler will use register for allocating your “local”

but if you chance like this, I think compiler will use register to allocate your xy and tp variables;

__device__ void foo()

{

	ushort2 xy;

	float2 tp;

}

of course two above methods is not realy same.

sigismondo · June 22, 2009, 2:26pm

I have recently read this explanation:

"The compiler cannot determine if an array is indexed with constant quantities. Please note that registers are not addressable so an array has to go into local memory – even if it is a two-element array – when the addressing of the array is not known at compile time. "

From:

http://www.ddj.com/hpc-high-performance-computing/215900921

Actually it sounds reasonable.

So you need to declare

mystruct local1;

mystruct local2;

mystruct local3;

In this way they shoud go to registers. Either I think you need to perform all the addressing of the array at compile time, i.e., using templates. In such a case the compiler MAY be able to optimize to registers.

flagman5 · July 3, 2009, 7:12am

I am also trying force my kernel to use more registers than shared memory (if it isnt already). I tried looking for code/functions to specifically declare to use registers instead of shared memory, but it seems that this is something that cannot be specified by the programmer but rather automatically by nvcc.

When I run deviceQuery, It shows:

Total amount of constant memory:			   65536 bytes

Total amount of shared memory per block:	   16384 bytes

Total number of registers available per block: 16384

seems that I have double the registers as you…if we are running the same kernel, would nvcc make use of all 16384 or still just use 8192?

When I compile the code with -cubin option and examine the .cubin file, the output:

architecture {sm_10}

abiversion {0}

modname {cubin}

consts  {

	name = d_k_fir_fff_taps

	segname = const

	segnum = 0

	offset = 0

	bytes = 65536

}

code  {

	name = __globfunc__Z40cuda_fir_fff_filter_7_kernel_short_decimPKfPf

jjj

	lmem = 0

	smem = 48

	reg = 11

	bar = 1

	bincode  {

shows only 11 registers?? why??

If I decrease the shared memory allocation, would more register be used? Can anyone please shed some light on this?

Thanks

nitin.life · July 3, 2009, 12:33pm

I am also trying force my kernel to use more registers than shared memory (if it isnt already). I tried looking for code/functions to specifically declare to use registers instead of shared memory, but it seems that this is something that cannot be specified by the programmer but rather automatically by nvcc.

When I run deviceQuery, It shows:
Total amount of constant memory:			   65536 bytes

 Total amount of shared memory per block:	   16384 bytes

 Total number of registers available per block: 16384
seems that I have double the registers as you…if we are running the same kernel, would nvcc make use of all 16384 or still just use 8192?

When I compile the code with -cubin option and examine the .cubin file, the output:
architecture {sm_10}

 abiversion {0}

 modname {cubin}

 consts  {

	 name = d_k_fir_fff_taps

	 segname = const

	 segnum = 0

	 offset = 0

	 bytes = 65536

 }

 code  {

	 name = __globfunc__Z40cuda_fir_fff_filter_7_kernel_short_decimPKfPf

 jjj

	 lmem = 0

	 smem = 48

	 reg = 11

	 bar = 1

	 bincode  {
shows only 11 registers?? why??

If I decrease the shared memory allocation, would more register be used? Can anyone please shed some light on this?

Thanks

Cards with compute capability >= 1.2 have 16384 registers, the older ones (1.0 , 1.1) have 8192. (APPENDIX A CUDA programming guide)

Also decreasing shared memory should not effect your register usage, but it may increase the occupancy of the multiprocessor “if” you number of thread blocks per sm… were limited by shared memory in first place.

The “11” is the numbers of register per thread nvcc is using.

(check cuda occupancy calculator spreadsheet)

NA

Topic		Replies	Views
Register vs local memory Forcing NVCC to use registers CUDA Programming and Performance	4	3180	June 29, 2007
CUDA __local__ force CUDA Programming and Performance	6	6085	June 26, 2008
Device Register Array and Device Malloc CUDA Programming and Performance	1	6710	May 13, 2011
Force a variable to be stored in a Register Is there any way to ensure a variable CUDA Programming and Performance	13	9183	May 21, 2010
Local variables and registers CUDA Programming and Performance	13	6343	March 23, 2010
Registers and Locally declared Variables Variables declared in _global_ functions CUDA Programming and Performance	4	5837	September 5, 2007
How to force declaring a variable as register? CUDA Programming and Performance	6	8279	April 21, 2008
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	25131	December 21, 2009
Help with variable allocate in register CUDA Programming and Performance	12	1839	July 24, 2016
How to reduce Local Memory Usage. CUDA Programming and Performance	19	12779	November 30, 2009

Global memory vs register storage How to force the compiler to use registers?

Related topics