Number of registers used with Dynamic Parallelism

daichi.fukuoka · April 17, 2020, 6:10am

When Dynamic Parallelism is used, is the actual number of regsiters used (# of threads per block 1) x (# of threads per block 2) x (# of registers used in kernel 1) x (# of registers used in kernel 2)?
nvcc --ptxas-options=-v shows the following,

ptxas info    : Function properties for calc(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int, int, int, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, int)
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 476 bytes cmem[0], 8 bytes cmem[2]
--
ptxas info    : Function properties for calcd(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int, int, int, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, unsigned int*, int, int, int)
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 88 registers, 1440 bytes smem, 484 bytes cmem[0], 120 bytes cmem[2]

In our case, registers per block is 65536(using Tesla P100), if this output is true it exceeds 65536.