maxrregcount > 32 breaks zerocopy memory

I am fighting to reduce register pressure on a kernel, and I begn looking at the
settting for maxrregcount=32.

[
nvcc
-gencode=arch=compute_20,code="sm_20,compute_20"
–machine 64
-ccbin ‘c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN’
-Xcompiler ‘/EHsc /W3 /nologo /O2 /Zi /MT -I … ’
-I’C:/Users/LHickey/cuda/common/inc’
-I’C:/Users/LHickey/cuda/shared/inc’
-I’C:/Users/LHickey/cuda/cbe/common’
-maxrregcount=32 \ <-------------------------------------------
-Xptxas=“-v”
–compile -o $@
$<
]

My kernel output goes into some zero copy memory in devO, and things seem to be working fine if maxrregcount=32.
I need to call cudaThreadSynchronize to make the zero copy memory in O show up on the cpu host side.
{
gpuKernel<<<nBlocks,BlockSz>>>(nScn, devP,dev_S,dev_X,devO);
cudaThreadSynchronize(); /* sync up mapped mem with host */ <------------------
}
Then I can look at the output in O.
devO is the address in gpu address space you get with
HANDLE_ERROR( cudaHostGetDevicePointer(&devO,O,0));


with the maxrregcount=32 the Xptas reports usage:
make all
1 tmpxft_0000074c_00000000-3_aop.cudafe1.gpu
2 tmpxft_0000074c_00000000-8_aop.cudafe2.gpu
3 aop.cu
4 ptxas info : Compiling entry function ‘_Z9gpuKerneliP4PdefP4SdefP4XdefP4Odef’ for ‘sm_20’
5 ptxas info : Used 32 registers, 72 bytes cmem[0], 36 bytes cmem[16] <--------------------------
6 ptxas info : Compiling entry function ‘_Z14float_to_colorP6uchar4PKf’ for ‘sm_20’
7 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
8 ptxas info : Compiling entry function ‘_Z14float_to_colorPhPKf’ for ‘sm_20’
9 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
10 tmpxft_0000074c_00000000-3_aop.cudafe1.cpp
11 tmpxft_0000074c_00000000-14_aop.ii

and line 5 says 32 reg used. ok that makes sense


with the maxrregcount=40 the Xptas reports usage:
make all
1 ptxas info : Compiling entry function ‘_Z9gpuKerneliP4PdefP4SdefP4XdefP4Odef’ for ‘sm_20’
2 ptxas info : Used 36 registers, 72 bytes cmem[0], 36 bytes cmem[16] <-----------------------------
3 ptxas info : Compiling entry function ‘_Z14float_to_colorP6uchar4PKf’ for ‘sm_20’
4 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
5 ptxas info : Compiling entry function ‘_Z14float_to_colorPhPKf’ for ‘sm_20’
6 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
7 tmpxft_00000ca4_00000000-3_aop.cudafe1.cpp
8 tmpxft_00000ca4_00000000-14_aop.ii

line 2 says that 36 regs used out of the 40. I would like to be able to use 36 then, not 32 I guess.

OK but the bad part: Setting maxrregcount to anything but 32 makes the data
that should appear in zero copy memory O not transferred. It stays 0.
Maybe cudaThreadSynchronize(); is failing. There is no copy needed from gpu men to host
cause O is the special zero copy memory

struct Odef O;
HANDLE_ERROR (cudaHostAlloc( (void
*) &O, nScn * sizeof (struct Odef),cudaHostAllocMapped));
{
struct Odef *devO;
HANDLE_ERROR( cudaHostGetDevicePointer(&devO,O,0)); <— that last parameter is zero,

}
Is a copy from L1 cache held by the kernel into O, mapped pinned memory , not happening because
maxrregcount not = 32?

I want to use zero copy memory rather than ususal device memory so I could save
the copy time of O into host mem every time my kernel cycles with a new data load.

I use no sharred memory at all, All high speed memory can be used for registers used the individual threads.
I tried wating for the output in O to finish from the kernel, but it never gets updated no matter how long I wait.

I am using a C2050 on a 64 bit windows 7 machine.
How can I increase maxrregcount to 36 and decrease my register presseure, (so more threads can run concurrently)
AND not break the ability of the kernel to write into the zero copy memory?

gpuKernel<<<nBlocks,BlockSz>>>(nScn, devP,dev_S,dev_X,devO);

cudaThreadSynchronize(); /* sync up mapped mem with host */

You don’t seem to check for kernel launch errors according to this code.

Use cudaGetLastError() after the cudaThreadSynchronize().

With 36 registers or more, trying to launch a block of 512 threads will

fail because 512 * 36 > 16384 (which is the size of the register file for

Compute 1.2 devices and better).

Christian

Thanks.