When playing around with layout with a basic (1-stage) gemm program written in CuTeDSL, I noticed that the JIT compiler would generate some dead code for local memory store, which are never read later.
Source Code:
for i_k in range(cute.size(tCrA, [2])):
# local stores are generated from the two lines below
cute.autovec_copy(tCsA[None, None, i_k], tCrA[None, None, i_k])
cute.autovec_copy(tCsB[None, None, i_k], tCrB[None, None, i_k])
for i_m in range(cute.size(tCrC, [1])):
for i_n in range(cute.size(tCrC, [2])):
cute.gemm(
....
)
My guess is that local memory stores are generated because compiler cannot generate tCrA, tCrB index as constant during compile time due to the usage of loop variable i_k. I also confirmed my guess unrolling the loop and did see the local stores go away.
However, what I failed to follow is that the subsequent fma instructions are able to load from the registers directly and the values stored to local mem were never retrieved later.
My questions are:
-
Is my understanding correct that the local mem stores were generated because of dynamic index?
-
Why these were not removed by compiler (-o3) even if they were never used later?
SASS snippet:
LDS.128 R68, [R0±0x100] ; Load from Shared Memory to Register R68
…
STL.128 [R81±0x10], R68 ; <— STORE REGISTER R68 TO LOCAL MEMORY (STACK)
FFMA R4, R68, R72, R4 ; Compute using Register R68
…
STL.128 [R82±0x10], R72 ; <— ANOTHER STORE TO LOCAL MEMORY
Full SASS
kernel_cutlass_kernel___main__SgemmAmpere_object_at__0
LDC R1, c[0x0][0x28]
S2R R92, SR_CgaCtaId
MOV R5, 0x400
VIADD R1, R1, 0xfffffe00
CS2R R6, SRZ
S2R R85, SR_TID.X
IMAD.MOV.U32 R87, RZ, RZ, RZ
CS2R R36, SRZ
VIADD R3, R1, 0x10
CS2R R38, SRZ
VIADD R80, R1, 0x110
CS2R R8, SRZ
CS2R R10, SRZ
CS2R R40, SRZ
CS2R R42, SRZ
CS2R R12, SRZ
CS2R R14, SRZ
CS2R R44, SRZ
CS2R R46, SRZ
CS2R R16, SRZ
CS2R R18, SRZ
CS2R R48, SRZ
CS2R R50, SRZ
CS2R R20, SRZ
CS2R R22, SRZ
CS2R R52, SRZ
CS2R R54, SRZ
CS2R R24, SRZ
CS2R R26, SRZ
CS2R R56, SRZ
CS2R R58, SRZ
CS2R R28, SRZ
CS2R R30, SRZ
LEA R92, R92, R5, 0x18
CS2R R4, SRZ
CS2R R60, SRZ
CS2R R62, SRZ
LOP3.LUT R67, R85, 0xf, RZ, 0xc0, !PT
IMAD.SHL.U32 R88, R85, 0x4, RZ
LOP3.LUT R89, R85, 0xf0, RZ, 0xc0, !PT
CS2R R32, SRZ
CS2R R34, SRZ
IMAD R90, R67, 0x10, R92
CS2R R64, SRZ
CS2R R66, SRZ
IADD3 R89, R89, 0x1100, R92
VIADD R90, R90, 0x100
LOP3.LUT R88, R88, 0x7c, RZ, 0xc0, !PT
ULDC.64 UR6, c[0x0][0x208]
S2R R84, SR_CTAID.X
S2R R83, SR_CTAID.Y
LDC.64 R70, c[0x0][0x218]
SHF.R.U32.HI R73, RZ, 0x5, R85
IMAD.SHL.U32 R2, R87, 0x8000, RZ
ULDC.64 UR4, c[0x0][0x210]
IMAD.SHL.U32 R0, R84, 0x80, RZ
LEA R69, R73, R88, 0xc
IMAD R73, R73, 0x10, R83
HFMA2.MMA R86, -RZ, RZ, 0, 0.00048828125
IMAD.MOV.U32 R81, RZ, RZ, R80
IADD3 R69, P1, P0, R2, R69, R0
IMAD.U32 R2, R73, 0x80, R88
SHF.R.S32.HI R0, RZ, 0x1f, R0
IMAD R73, R85, 0x10, R92
IMAD.MOV.U32 R82, RZ, RZ, R3
IADD3.X R0, RZ, RZ, R0, P1, P0
LEA R68, P0, R69, UR4, 0x2
LEA R2, R87, R2, 0xe
VIADD R87, R87, 0x1
LEA.HI.X R69, R69, UR5, R0, 0x2, P0
MOV R0, R90
IMAD.WIDE.U32 R70, R2, 0x4, R70
ISETP.NE.AND P0, PT, R87, 0x100, PT
@!PT LDS RZ, [RZ]
@!PT LDS RZ, [RZ]
@!PT LDS RZ, [RZ]
LDGSTS.E.LTC128B.128 desc[UR6][R68.64], [R73]
MOV R2, R89
LDGSTS.E.LTC128B.128 desc[UR6][R70.64], [R73+0x1000]
LDGDEPBAR
DEPBAR.LE SB0, 0x0
BAR.SYNC.DEFER_BLOCKING 0x1, 0x100
LDS.128 R68, [R0+-0x100]
IADD3 R86, R86, 0x200, RZ
LDS.128 R72, [R2+-0x100]
ISETP.NE.AND P1, PT, R86, 0x2000, PT
LDS.128 R76, [R2]
IADD3 R2, R2, 0x200, RZ
STL.128 [R81+-0x10], R68
FFMA R4, R68, R72, R4
FFMA R8, R68, R73, R8
FFMA R12, R68, R74, R12
FFMA R16, R68, R75, R16
FFMA R20, R68, R76, R20
FFMA R24, R68, R77, R24
FFMA R28, R68, R78, R28
FFMA R32, R68, R79, R32
FFMA R5, R69, R72, R5
FFMA R9, R69, R73, R9
FFMA R13, R69, R74, R13
FFMA R17, R69, R75, R17
FFMA R21, R69, R76, R21
FFMA R25, R69, R77, R25
FFMA R29, R69, R78, R29
FFMA R33, R69, R79, R33
FFMA R6, R70, R72, R6
FFMA R10, R70, R73, R10
FFMA R14, R70, R74, R14
FFMA R18, R70, R75, R18
FFMA R22, R70, R76, R22
FFMA R26, R70, R77, R26
FFMA R30, R70, R78, R30
FFMA R34, R70, R79, R34
FFMA R7, R71, R72, R7
FFMA R11, R71, R73, R11
FFMA R15, R71, R74, R15
FFMA R19, R71, R75, R19
FFMA R23, R71, R76, R23
FFMA R27, R71, R77, R27
FFMA R31, R71, R78, R31
FFMA R35, R71, R79, R35
LDS.128 R68, [R0]
IADD3 R0, R0, 0x200, RZ
STL.128 [R81], R68
FFMA R36, R68, R72, R36
FFMA R40, R68, R73, R40
FFMA R44, R68, R74, R44
STL.128 [R82+-0x10], R72
FFMA R48, R68, R75, R48
FFMA R52, R68, R76, R52
FFMA R56, R68, R77, R56
STL.128 [R82], R76
FFMA R60, R68, R78, R60
FFMA R64, R68, R79, R64
FFMA R37, R69, R72, R37
FFMA R41, R69, R73, R41
FFMA R45, R69, R74, R45
FFMA R49, R69, R75, R49
FFMA R53, R69, R76, R53
FFMA R57, R69, R77, R57
FFMA R61, R69, R78, R61
FFMA R65, R69, R79, R65
FFMA R38, R70, R72, R38
FFMA R42, R70, R73, R42
FFMA R46, R70, R74, R46
FFMA R50, R70, R75, R50
FFMA R54, R70, R76, R54
FFMA R58, R70, R77, R58
FFMA R62, R70, R78, R62
FFMA R66, R70, R79, R66
FFMA R39, R71, R72, R39
FFMA R43, R71, R73, R43
FFMA R47, R71, R74, R47
FFMA R51, R71, R75, R51
FFMA R55, R71, R76, R55
IADD3 R81, R81, 0x20, RZ
FFMA R59, R71, R77, R59
FFMA R63, R71, R78, R63
FFMA R67, R71, R79, R67
IADD3 R82, R82, 0x20, RZ
@P1 BRA 0x7e4355f95c20
BAR.SYNC.DEFER_BLOCKING 0x1, 0x100
@P0 BRA 0x7e4355f95a30
S2R R0, SR_TID.X
IMAD R83, R83, 0x1000, R84
ULDC.64 UR4, c[0x0][0x220]
IMAD.SHL.U32 R83, R83, 0x80, RZ
IMAD R0, R0, 0x100, R0
IMAD.SHL.U32 R0, R0, 0x4, RZ
LOP3.LUT R0, R0, 0x3c03c, RZ, 0xc0, !PT
IADD3 R3, P0, R0, R83, RZ
LEA.HI.X.SX32 R0, R83, RZ, 0x1, P0
LEA R2, P0, R3, UR4, 0x2
LEA.HI.X R3, R3, UR5, R0, 0x2, P0
STG.E.128 desc[UR6][R2.64], R4
STG.E.128 desc[UR6][R2.64+0x100], R36
STG.E.128 desc[UR6][R2.64+0x4000], R8
STG.E.128 desc[UR6][R2.64+0x4100], R40
STG.E.128 desc[UR6][R2.64+0x8000], R12
STG.E.128 desc[UR6][R2.64+0x8100], R44
STG.E.128 desc[UR6][R2.64+0xc000], R16
STG.E.128 desc[UR6][R2.64+0xc100], R48
STG.E.128 desc[UR6][R2.64+0x100000], R20
STG.E.128 desc[UR6][R2.64+0x100100], R52
STG.E.128 desc[UR6][R2.64+0x104000], R24
STG.E.128 desc[UR6][R2.64+0x104100], R56
STG.E.128 desc[UR6][R2.64+0x108000], R28
STG.E.128 desc[UR6][R2.64+0x108100], R60
STG.E.128 desc[UR6][R2.64+0x10c000], R32
STG.E.128 desc[UR6][R2.64+0x10c100], R64
EXIT
BRA 0x7e4355f962f0
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP