When Dynamic Parallelism is used, is the actual number of regsiters used (# of threads per block 1) x (# of threads per block 2) x (# of registers used in kernel 1) x (# of registers used in kernel 2)?
nvcc --ptxas-options=-v
shows the following,
ptxas info : Function properties for calc(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int, int, int, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, int)
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 476 bytes cmem[0], 8 bytes cmem[2]
--
ptxas info : Function properties for calcd(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int, int, int, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, int, int, int)
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 88 registers, 1440 bytes smem, 484 bytes cmem[0], 120 bytes cmem[2]
In our case, registers per block is 65536(using Tesla P100), if this output is true it exceeds 65536.