Why does a store cause a load into a uniform register on SM89?

Consider the following code.

#include <cstdint>
__global__ void copy(uint32_t * z, uint32_t x) {
    *z = x;
}

The generated PTX is fairly straightforward.

.visible .entry copy(unsigned int*, unsigned int)(
        .param .u64 copy(unsigned int*, unsigned int)_param_0,
        .param .u32 copy(unsigned int*, unsigned int)_param_1
)
{

        ld.param.u64    %rd1, [copy(unsigned int*, unsigned int)_param_0];
        ld.param.u32    %r1, [copy(unsigned int*, unsigned int)_param_1];
        cvta.to.global.u64      %rd2, %rd1;
        st.global.u32   [%rd2], %r1;
        ret;

}

But on NVCC 12.0.0 and above (when specifying via -gencode arch=compute_XX,code=sm_XX), the generated SASS seems to include a useless load into UR4 when targeting SM80 and SM89, but not for SM75.

// SM80 and SM89
copy(unsigned int*, unsigned int):
 IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
 IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x168] 
 MOV R2, c[0x0][0x160] 
 ULDC.64 UR4, c[0x0][0x118]  // This line here
 MOV R3, c[0x0][0x164] 
 STG.E [R2.64], R5 
 EXIT 

For SM90 we get the same load, but it’s actually used in the store. I expect this, because of TMA support.

// SM90
copy(unsigned int*, unsigned int):
 LDC R1, c[0x0][0x28] 
 LDC R5, c[0x0][0x218] 
 ULDC.64 UR4, c[0x0][0x208] 
 LDC.64 R2, c[0x0][0x210] 
 STG.E desc[UR4][R2.64], R5 
 EXIT 

But for SM75 the load into the uniform register is actually used.

// SM75
copy(unsigned int*, unsigned int):
 MOV R1, c[0x0][0x28] 
 MOV R0, c[0x0][0x168] 
 ULDC.64 UR4, c[0x0][0x160] 
 STG.E.SYS [UR4], R0 
 EXIT 

The uniform register load also disappears for all compilations below SM75, which makes sense (since uniform registers don’t exist on those platforms).

My question is: why is there an extra load into the uniform register for code compiled for SM80 and SM89? There must be a reason why the compiler emits the load, because it’s not exactly a natural transformation (and I’d expect it to have been optimised away, since it’s clear that the register isn’t used).

1 Like