I am fighting to reduce register pressure on a kernel, and I begn looking at the
settting for maxrregcount=32.
[
nvcc
-gencode=arch=compute_20,code="sm_20,compute_20"
–machine 64
-ccbin ‘c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN’
-Xcompiler ‘/EHsc /W3 /nologo /O2 /Zi /MT -I … ’
-I’C:/Users/LHickey/cuda/common/inc’
-I’C:/Users/LHickey/cuda/shared/inc’
-I’C:/Users/LHickey/cuda/cbe/common’
-maxrregcount=32 \ <-------------------------------------------
-Xptxas=“-v”
–compile -o $@
$<
]
My kernel output goes into some zero copy memory in devO, and things seem to be working fine if maxrregcount=32.
I need to call cudaThreadSynchronize to make the zero copy memory in O show up on the cpu host side.
{
gpuKernel<<<nBlocks,BlockSz>>>(nScn, devP,dev_S,dev_X,devO);
cudaThreadSynchronize(); /* sync up mapped mem with host */ <------------------
}
Then I can look at the output in O.
devO is the address in gpu address space you get with
HANDLE_ERROR( cudaHostGetDevicePointer(&devO,O,0));
with the maxrregcount=32 the Xptas reports usage:
make all
1 tmpxft_0000074c_00000000-3_aop.cudafe1.gpu
2 tmpxft_0000074c_00000000-8_aop.cudafe2.gpu
3 aop.cu
4 ptxas info : Compiling entry function ‘_Z9gpuKerneliP4PdefP4SdefP4XdefP4Odef’ for ‘sm_20’
5 ptxas info : Used 32 registers, 72 bytes cmem[0], 36 bytes cmem[16] <--------------------------
6 ptxas info : Compiling entry function ‘_Z14float_to_colorP6uchar4PKf’ for ‘sm_20’
7 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
8 ptxas info : Compiling entry function ‘_Z14float_to_colorPhPKf’ for ‘sm_20’
9 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
10 tmpxft_0000074c_00000000-3_aop.cudafe1.cpp
11 tmpxft_0000074c_00000000-14_aop.ii
and line 5 says 32 reg used. ok that makes sense
with the maxrregcount=40 the Xptas reports usage:
make all
1 ptxas info : Compiling entry function ‘_Z9gpuKerneliP4PdefP4SdefP4XdefP4Odef’ for ‘sm_20’
2 ptxas info : Used 36 registers, 72 bytes cmem[0], 36 bytes cmem[16] <-----------------------------
3 ptxas info : Compiling entry function ‘_Z14float_to_colorP6uchar4PKf’ for ‘sm_20’
4 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
5 ptxas info : Compiling entry function ‘_Z14float_to_colorPhPKf’ for ‘sm_20’
6 ptxas info : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
7 tmpxft_00000ca4_00000000-3_aop.cudafe1.cpp
8 tmpxft_00000ca4_00000000-14_aop.ii
line 2 says that 36 regs used out of the 40. I would like to be able to use 36 then, not 32 I guess.
OK but the bad part: Setting maxrregcount to anything but 32 makes the data
that should appear in zero copy memory O not transferred. It stays 0.
Maybe cudaThreadSynchronize(); is failing. There is no copy needed from gpu men to host
cause O is the special zero copy memory
struct Odef O;
HANDLE_ERROR (cudaHostAlloc( (void*) &O, nScn * sizeof (struct Odef),cudaHostAllocMapped));
{
struct Odef *devO;
HANDLE_ERROR( cudaHostGetDevicePointer(&devO,O,0)); <— that last parameter is zero,
…
}
Is a copy from L1 cache held by the kernel into O, mapped pinned memory , not happening because
maxrregcount not = 32?
I want to use zero copy memory rather than ususal device memory so I could save
the copy time of O into host mem every time my kernel cycles with a new data load.
I use no sharred memory at all, All high speed memory can be used for registers used the individual threads.
I tried wating for the output in O to finish from the kernel, but it never gets updated no matter how long I wait.
I am using a C2050 on a 64 bit windows 7 machine.
How can I increase maxrregcount to 36 and decrease my register presseure, (so more threads can run concurrently)
AND not break the ability of the kernel to write into the zero copy memory?