Registers in Fermi (cc2.0) for cuda fortran

I’m using Tesla M2050 (Fermi cc2.0), which should have 32K 32-bit registers. I setup 64 threads per SM, which is quite good value for my case. So the maximum number of registers per threads is 512.

My kernel code seems to be a little bit heavy (about 1000 lines). I try to add another 24 registers for buffering 12 Double-Precision-type data. But the code is much slower than the version without this buffer. So I was thinking whether there might be a bug (limit register number to 63) in compiling for cc2.0.
You would see whatever efforts I tried to add the buffer into the code, the registers number is always 63 registers for cc2.0 while it changes in cc1.3.

Without buffer version:

ptxas info : Compiling entry function ‘raycast’ for ‘sm_13’
ptxas info : Used 100 registers, 232+0 bytes lmem, 2144+16 bytes smem, 2768 bytes cmem[0], 140 bytes cmem[1], 4 bytes cmem[14]
executing /opt/applications/pgi/linux86-64/11.4/bin/pgnvd raycast_GPUkernel.001.gpu -computecap=20 -ptx /tmp/pbs.17272.service0/pgcudafor6sDbUr_kohB8.ptx -o /tmp/pbs.17272.service0/pgcudaforksDbEFWhDSWo.bin -3.2 -info

ptxas info : Compiling entry function ‘raycast’ for ‘sm_20’
ptxas info : Used 63 registers, 8+0 bytes lmem, 2128+0 bytes smem, 48 bytes cmem[0], 2768 bytes cmem[2], 4 bytes cmem[14], 40 bytes cmem[16]
0 inform, 0 warnings, 0 severes, 0 fatal for …cuda_fortran_constructor_1
PGF90/x86-64 Linux 11.4-0: compilation successful

With buffer version:

ptxas info : Compiling entry function ‘raycast’ for ‘sm_13’
ptxas info : Used 123 registers, 232+0 bytes lmem, 2144+16 bytes smem, 2768 bytes cmem[0], 140 bytes cmem[1], 4 bytes cmem[14]
executing /opt/applications/pgi/linux86-64/11.4/bin/pgnvd raycast_GPUkernel.001.gpu -computecap=20 -ptx /tmp/pbs.17366.service0/pgcudaforIeBdM2J82rkF.ptx -o /tmp/pbs.17366.service0/pgcudaforYeBdwRcAfWPd.bin -3.2 -info
ptxas info : Compiling entry function ‘raycast’ for ‘sm_20’
ptxas info : Used 63 registers, 8+0 bytes lmem, 2128+0 bytes smem, 48 bytes cmem[0], 2768 bytes cmem[2], 4 bytes cmem[14], 48 bytes cmem[16]
0 inform, 0 warnings, 0 severes, 0 fatal for …cuda_fortran_constructor_1
PGF90/x86-64 Linux 11.4-0: compilation successful

Another issue, the -ta=nvidia:cc20 or -ta=nvidia:cc13 seems to automatically toggle on a high level optimization of compiler (-O3 or -O2).
I have updated pgfortran V11.4 version now (In V11.3, -O3 and -O2 modes have a bug to translate my code into .gpu files, which leeds to out-bound visit on array). But -O3 or -O2 mode is not as -O1 or -O0 model for my code. (leads to 15% more computation costs)

So without -ta=nvidia:ccxx parameter, which version would be finally executed in my Fermi GPUs? It compiles with both sm_13 and sm_20.