Consider the following code.
#include <cstdint>
__global__ void copy(uint32_t * z, uint32_t x) {
*z = x;
}
The generated PTX is fairly straightforward.
.visible .entry copy(unsigned int*, unsigned int)(
.param .u64 copy(unsigned int*, unsigned int)_param_0,
.param .u32 copy(unsigned int*, unsigned int)_param_1
)
{
ld.param.u64 %rd1, [copy(unsigned int*, unsigned int)_param_0];
ld.param.u32 %r1, [copy(unsigned int*, unsigned int)_param_1];
cvta.to.global.u64 %rd2, %rd1;
st.global.u32 [%rd2], %r1;
ret;
}
But on NVCC 12.0.0 and above (when specifying via -gencode arch=compute_XX,code=sm_XX
), the generated SASS seems to include a useless load into UR4 when targeting SM80 and SM89, but not for SM75.
// SM80 and SM89
copy(unsigned int*, unsigned int):
IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28]
IMAD.MOV.U32 R5, RZ, RZ, c[0x0][0x168]
MOV R2, c[0x0][0x160]
ULDC.64 UR4, c[0x0][0x118] // This line here
MOV R3, c[0x0][0x164]
STG.E [R2.64], R5
EXIT
For SM90 we get the same load, but it’s actually used in the store. I expect this, because of TMA support.
// SM90
copy(unsigned int*, unsigned int):
LDC R1, c[0x0][0x28]
LDC R5, c[0x0][0x218]
ULDC.64 UR4, c[0x0][0x208]
LDC.64 R2, c[0x0][0x210]
STG.E desc[UR4][R2.64], R5
EXIT
But for SM75 the load into the uniform register is actually used.
// SM75
copy(unsigned int*, unsigned int):
MOV R1, c[0x0][0x28]
MOV R0, c[0x0][0x168]
ULDC.64 UR4, c[0x0][0x160]
STG.E.SYS [UR4], R0
EXIT
The uniform register load also disappears for all compilations below SM75, which makes sense (since uniform registers don’t exist on those platforms).
My question is: why is there an extra load into the uniform register for code compiled for SM80 and SM89? There must be a reason why the compiler emits the load, because it’s not exactly a natural transformation (and I’d expect it to have been optimised away, since it’s clear that the register isn’t used).