too many registers issue with memory writes and registers

hello all,

i have been programming molecular dynamics in CUDA, and i mainly used Tesla C1060 architecture (compiling for arch 1.3), toolkit 3.1. Few days ago i compiled the same code on a Fermi (-arch 2.0) and the compiler said the kernels use a way larger amount of registers. using fast-math helps keeping the amount of registers down but they are still more than the 1.3 compiled kernel.
I investigated a bit and i wrote a dummy kernel: all it does is writing some values in the global memory. something like:

int idx = threadIdx.x + blockIdx.x*BLOCKSIZE;

//global memory writes
array1[idx] = threadIdx.x + gridDim.x;
array2[idx] = some other value…

here i count only one register (int idx), ok maybe the GPU needs a temporary register to store partial results (such as threadIdx.x + gridDim.x), but the more global memory writes are performed, the more registers are consumed. Could anyone explain this? i think i am missing something. Is this related to cache (one of the big differences between 1.3 and 2.0)… is it a compiler issue?

thanks for your help.

cheers

flp

It makes sense a little bit.

I think the GPU probably needs a register to store the results of your calculations.

Then that register can be moved to main memory.

what does not make sense is that every temporary operation needs a new register. performance of a kernel that computes lots of stuff is then destroyed. and i dont really get why arch 1.3 does not show this behaviour as much: if i count manually the registers i get almost the same as what the --ptxas-options=v tells (with a mismatch of ~1-2). While with arch 2.0 if i count 14 registers, ptxas tells me i am using ~28.
With such humongus number of registers used, the occupancy is reduced.
i will test more and post some examples.

ok, here is the real example i tested right now. the kernel is

global void testkernel (float *in, float out)
{
int idx = threadIdx.x + blockIdx.x
BLOCK;
float val = in[idx];
val += (float)idx;

out[idx] = val;
}

if i compile for 1.3: -arch sm_13 --ptxas-options=-v … i get
ptxas info : Compiling entry function ‘Z10testkernelPfS’ for ‘sm_13’
ptxas info : Used 4 registers, 16+16 bytes smem, 272 bytes cmem[0], 4 bytes cmem[1]

with compiler options: -arch sm_20 --ptxas-options=-v … i get
ptxas info : Compiling entry function ‘Z10testkernelPfS’ for ‘sm_20’
ptxas info : Used 8 registers, 48 bytes cmem[0], 272 bytes cmem[2]

I count 2 registers but i can understand it needs 4 to store the temporary results (arch 1.3). but what is going on in arch 2.0? 8 registers?

Using 8 registers is actually better than 4 in the case you presented. It allows greater room for instruction-level parallelism.

Things only become problematic when register count goes above 22 (if you want to launch concurrent blocks) or 32 (if you just need 1024 threads in a block).

thanks for your reply.

unfortunately that kernel is just a dummy one i used to investigate the issue. the real kernels indeed use lots of threads and lots of blocks. and i sweat a lot to keep the register count low when i was optimizing, then Fermi arrived and the register count exploded!

could you please explain me how overflowing the register improves the parallelism? do you know of any compiler flag ( beside arch 1.3 ) that makes the gpu do what i want it to do?
on a side note: if i compile for arch 1.3 and run on arch 2.0, do i get the performance of 2.0? should i look at the 2.0 arch in the cuda occupancy calculator?

sorry if this looks trivial, i am not a compiler expert : (

thanks

flp

I am guessing you are on a 64-bit host platform, because that is typically where the biggest changes in register use are seen when transitioning from sm_1x to sm_2x.

CUDA makes all device-side data types the same size as the corresponding host-side data type. This makes compound data types portable across the host-device boundry, for example. In particular, on a 64-bit host platform pointers and size_t are 64-bit types on both host and device, and for those 64-bit platforms where a long occupies 64 bits the same applies to code on the device.

Note that there is no hardware support for most 64-bit integer operations on current GPUs, meaning a 64-bit operand takes up two 32-bit registers and most 64-bit operations require two or more 32-bit operations.

No sm_1x platform accomodated more than 4GB of memory, meaning for pointers on the device only the lower 32 bits were meaningful. This allowed the compiler to optimize out a lot of the operations that dealt with the most significant 32 bits of device pointers, which in turn freed up the registers holding the upper 32 bits.

There are however sm_2x platforms that provide more than 4 GB of memory, so the majority of these optimizations no longer apply. This means that for builds on a 64-bit host, code compiled for sm_2x carries around the full 64 bits of a pointer at all times, which increases the register use compared to an sm_1x target. Note that these pointers are not only pointers explicitly occuring in the source code, but could also be pointers created by the compiler as part of common optimizations, for example by strength reduction / induction variable creation during array traversal in a loop.

For code that works a lot with shared memory, there is an additional source of increased register usage. In sm_1x devices, shared memory could be accessed via special address registers that existed in addition to the general purpiose register, but on sm_2x device general purpose registers are used for this purpose as separate address registers no longer exist.

There are additional second order effects that can increase register pressure on sm_2x devices. The sm_2x instruction set architecture is more of a strict load/store architecture, and therefore frequently requires a couple of additional registers for temporary storage. The introduction of an ABI (needed to support many C++ features, device-side printf and malloc, etc) means that one extra register is required for a stack pointer.

The compiler has been improved consistently since sm_2x support was first added in CUDA 3.0. At this point you would definitely want to use the CUDA 4.0 toolchain. Please keep in mind that based on the changes described above, some increase in register usage is unavoidable for most code when transitioning from sm_1x to sm_2x even as the compiler works hard to keep register usage low. While this is not typically necessary, you may want to look into using either the launch_bounds attribute or the -maxrregcount compiler switch to limit register use per thread below the bounds picked by the compiler. Note that this may cause dynamic instruction count to increase and/or register spilling to occur, either of which may in turn decrease performance. The caches on sm_2x devices can absorb minor spilling, but significant spilling will overwhelm the caches. The compiler flag -Xptxas -v causes the compiler to emit some relevant statistics for each kernel.

[later:]

If your code uses single-precision reciprocal, divisions, and square root, there is an additional source of register use increase when transitioning from sm_1x to sm_2x. For sm_1x platforms the division operator (with reciprocal as a special sub-case) and the sqrtf() function map to approximate versions, while on sm_2x platforms they map to IEEE-rounded versions by default. The increase in accuracy requires more elaborate implementations that use more instruction and some additional registers. You can approximate the sm_1x behavior by passing the following compiler flags to nvcc: -ftz=true -prec-sqrt=false -prec-div=false

thank you for your clear and complete reply.
I will try to play around with those flags and see when i get the most performance, and try not to obsess about getting 100% occupancy.
this was very helpful, thanks to everybody.

cheers

flp