I want to compile a shared library from dgemm_kernel_default.cu using the following command:
nvcc -gencode arch=compute_60,code=$ARCH -O3 --shared --compiler-options -fPIC,-O3 -Xcicc -O3 -Xptxas -v,-O3 -keep dgemm_kernel_default.cu -o libdgemm_kernel.so
I tried 3 options for ARCH: sm_60, sm_70 and sm_80.
ARCH |
stack_frame(bytes) |
spill_loads(bytes) |
spill_stores(bytes) |
sm_60 |
0 |
0 |
0 |
sm_70 |
56 |
92 |
52 |
sm_80 |
48 |
76 |
44 |
Theoretically, the kernel function dgemm_kernel uses 64 registers for accumulators, 22 registers for global_memory loads (1 ptr, 2 counters, 8 fp64 vars) and additional 2 registers for shared_memory stores. For the rest 40 registers (total 128 for launching 16 warps per block), only 24 are required for loading A/B elements from shared memory(and some of them can be reused for global_memory loads). So there is no reason for register spilling.
The NVCC version was V12.3.107.
Do the spills vanish, if you allow more registers?
More registers can’t be used due to the presence of the launch bounds directive in the code. (Unless you make code changes.)
You can see how the registers are used using the binary tools register life view. (nvdisasm -plr, compile to cubin first.)
Sure, lowering the number of threads per block (in launch bound) can eliminate register spilling. But it will reduce SM occupancy, which will impact the performance.
I observed that PTXAS often allocates different registers for the addend and destination in the DFMA instruction, which is absolutely unnecessary for GEMM but bad for reducing register usage.
Another strange thing, although there are 12 registers spilling to stack for sm_80, the “number of occupied registers” in the output of “nvdisasm -g -c -plr dgemm_kernel_default.cubin” never exceeds 106, while the register limit is 128. Those lines using 106 registers don’t use R100-R101, R103-R107, R112-R115, R120-R123 and R125-R127.
Just asking to make sure that it is the number of registers and not some some other reason to use local memory: non-inlined function calls, dynamic array indexing, using pointers/references, which cannot be resolved at compile-time.
Possibly related: Wasn’t there some option to enable/disable some device code ABI conventions. Perhaps the mentioned registers have special meaning within the ABI (but then, why would they be above register 63?).
Finally I found that the problem originated from several “assert” calls in the code. Register spilling vanished after simply defining NDEBUG. @Robert_Crovella @Curefab Thanks for your help.