Nvdisasm vs ncu: discrepancy in SASS for register spilling

This question is every kernel I have tested (vector add, matrix multiply etc)
I tried to introduce register spilling by using the flag --maxrregcount=16 while compiling with nvcc. Then I used nvdisasm to disassemble the executable as nvdisasm -g -c <app>. The SASS shows presence of LDL and STL instructions. However these instructions are not present in the ‘Source and SASS’ page of Nsight compute after I profile the executable with ncu. Why is this the case? How can the SASS lines be different from the same executable?

A side question: SASS generated from nvdisasm looks like /*0008*/ MOV R1, c[0x0][0x20] ;. In some textbooks or articles, the instructions are multiple of 10, (like /*0010*/, /*0020*/...), but the ones from cuobjdump or nvdisasm are /*0008*/, /*0010*/, /*0018*/, /*0020*/... Why is there a difference there? How can I get the instructions header (or pcOffset) to be multiple of 10: /*0010*/, /*0020*/, /*0030*/....? Are there any flags for the same?

It’s most likely due to different GPU generations using different methods of adding control information.

You can read the detail of this in Chapter 2 of the “Dissecting Turing” paper.

I see the same /*0008*/, /*0010*/, /*0018*/, /*0020*/...if I choose any sm_xx version

Here are two results from the output of “cuobjdump --dump-sass”, for the same source file, one compiled “-gencode arch=compute_61,code=sm_61” and one "-gencode arch=compute_75,code=sm_75 ':

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
identifier = ../arrtest.cu

        code for sm_61
                Function : _Z5blockPj
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                                          /* 0x001c7c00fe0007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                                 /* 0x4c98078000870001 */
        /*0010*/         {         IADD32I R1, R1, -0x8 ;                                 /* 0x1c0fffffff870101 */
        /*0018*/                   S2R R22, SR_TID.X         }
                                                                                          /* 0xf0c8000002170016 */
                                                                                          /* 0x001fc400e1a00ff0 */
        /*0028*/         {         ISETP.GT.U32.AND P0, PT, R22, 0xff, PT ;               /* 0x366803800ff71607 */
        /*0030*/                   S2R R20, SR_CTAID.X         }
                                                                                          /* 0xf0c8000002570014 */
        /*0038*/              @!P0 MOV32I R3, 0x0 ;                                       /* 0x010000000008f003 */
                                                                                          /* 0x001fd800fe8207f1 */
        /*0048*/              @!P0 SHR.U32 R5, R22.reuse, 0x1f ;                          /* 0x3828000001f81605 */


Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
identifier = ../arrtest.cu

        code for sm_75
                Function : _Z5blockPj
        .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                /* 0x00000a00ff017624 */
                                                                                          /* 0x000fd000078e00ff */
        /*0010*/                   S2R R3, SR_TID.X ;                                     /* 0x0000000000037919 */
                                                                                          /* 0x000e220000002100 */
        /*0020*/                   BMOV.32.CLEAR RZ, B0 ;                                 /* 0x0000000000ff7355 */
                                                                                          /* 0x000fe20000100000 */
        /*0030*/                   BSSY B0, 0x390 ;                                       /* 0x0000035000007945 */
                                                                                          /* 0x000fe20003800000 */
        /*0040*/                   IADD3 R1, R1, -0x8, RZ ;                               /* 0xfffffff801017810 */

As mentioned above, the difference between /0008/, /0010/, etc… is from different SM versions. The offsets/sizes for instructions can differ. For the spilling, nvdisasm will show SASS for all the SM versions in the binary. Nsight Compute only shows SASS for the GPU where the profile was run. Different SM versions can support different register counts without spilling so you may or may not see it in the Nsight Compute profile.