Dead code for local memory stores

When playing around with layout with a basic (1-stage) gemm program written in CuTeDSL, I noticed that the JIT compiler would generate some dead code for local memory store, which are never read later.

Source Code:

        for i_k in range(cute.size(tCrA, [2])):
            # local stores are generated from the two lines below
            cute.autovec_copy(tCsA[None, None, i_k], tCrA[None, None, i_k]) 
            cute.autovec_copy(tCsB[None, None, i_k], tCrB[None, None, i_k])
            for i_m in range(cute.size(tCrC, [1])):
                for i_n in range(cute.size(tCrC, [2])):
                    cute.gemm(
                        ....
                    )

My guess is that local memory stores are generated because compiler cannot generate tCrA, tCrB index as constant during compile time due to the usage of loop variable i_k. I also confirmed my guess unrolling the loop and did see the local stores go away.

However, what I failed to follow is that the subsequent fma instructions are able to load from the registers directly and the values stored to local mem were never retrieved later.

My questions are:

  1. Is my understanding correct that the local mem stores were generated because of dynamic index?

  2. Why these were not removed by compiler (-o3) even if they were never used later?

    SASS snippet:

 LDS.128 R68, [R0±0x100]       ; Load from Shared Memory to Register R68
…
STL.128 [R81±0x10], R68       ; <— STORE REGISTER R68 TO LOCAL MEMORY (STACK)
FFMA R4, R68, R72, R4          ; Compute using Register R68
…
STL.128 [R82±0x10], R72       ; <— ANOTHER STORE TO LOCAL MEMORY

Full SASS

kernel_cutlass_kernel___main__SgemmAmpere_object_at__0
      LDC R1, c[0x0][0x28]
      S2R R92, SR_CgaCtaId
      MOV R5, 0x400
      VIADD R1, R1, 0xfffffe00
      CS2R R6, SRZ
      S2R R85, SR_TID.X
      IMAD.MOV.U32 R87, RZ, RZ, RZ
      CS2R R36, SRZ
      VIADD R3, R1, 0x10
      CS2R R38, SRZ
      VIADD R80, R1, 0x110
      CS2R R8, SRZ
      CS2R R10, SRZ
      CS2R R40, SRZ
      CS2R R42, SRZ
      CS2R R12, SRZ
      CS2R R14, SRZ
      CS2R R44, SRZ
      CS2R R46, SRZ
      CS2R R16, SRZ
      CS2R R18, SRZ
      CS2R R48, SRZ
      CS2R R50, SRZ
      CS2R R20, SRZ
      CS2R R22, SRZ
      CS2R R52, SRZ
      CS2R R54, SRZ
      CS2R R24, SRZ
      CS2R R26, SRZ
      CS2R R56, SRZ
      CS2R R58, SRZ
      CS2R R28, SRZ
      CS2R R30, SRZ
      LEA R92, R92, R5, 0x18
      CS2R R4, SRZ
      CS2R R60, SRZ
      CS2R R62, SRZ
      LOP3.LUT R67, R85, 0xf, RZ, 0xc0, !PT
      IMAD.SHL.U32 R88, R85, 0x4, RZ
      LOP3.LUT R89, R85, 0xf0, RZ, 0xc0, !PT
      CS2R R32, SRZ
      CS2R R34, SRZ
      IMAD R90, R67, 0x10, R92
      CS2R R64, SRZ
      CS2R R66, SRZ
      IADD3 R89, R89, 0x1100, R92
      VIADD R90, R90, 0x100
      LOP3.LUT R88, R88, 0x7c, RZ, 0xc0, !PT
      ULDC.64 UR6, c[0x0][0x208]
      S2R R84, SR_CTAID.X
      S2R R83, SR_CTAID.Y
      LDC.64 R70, c[0x0][0x218]
      SHF.R.U32.HI R73, RZ, 0x5, R85
      IMAD.SHL.U32 R2, R87, 0x8000, RZ
      ULDC.64 UR4, c[0x0][0x210]
      IMAD.SHL.U32 R0, R84, 0x80, RZ
      LEA R69, R73, R88, 0xc
      IMAD R73, R73, 0x10, R83
      HFMA2.MMA R86, -RZ, RZ, 0, 0.00048828125
      IMAD.MOV.U32 R81, RZ, RZ, R80
      IADD3 R69, P1, P0, R2, R69, R0
      IMAD.U32 R2, R73, 0x80, R88
      SHF.R.S32.HI R0, RZ, 0x1f, R0
      IMAD R73, R85, 0x10, R92
      IMAD.MOV.U32 R82, RZ, RZ, R3
      IADD3.X R0, RZ, RZ, R0, P1, P0
      LEA R68, P0, R69, UR4, 0x2
      LEA R2, R87, R2, 0xe
      VIADD R87, R87, 0x1
      LEA.HI.X R69, R69, UR5, R0, 0x2, P0
      MOV R0, R90
      IMAD.WIDE.U32 R70, R2, 0x4, R70
      ISETP.NE.AND P0, PT, R87, 0x100, PT
@!PT  LDS RZ, [RZ]
@!PT  LDS RZ, [RZ]
@!PT  LDS RZ, [RZ]
      LDGSTS.E.LTC128B.128 desc[UR6][R68.64], [R73]
      MOV R2, R89
      LDGSTS.E.LTC128B.128 desc[UR6][R70.64], [R73+0x1000]
      LDGDEPBAR
      DEPBAR.LE SB0, 0x0
      BAR.SYNC.DEFER_BLOCKING 0x1, 0x100
      LDS.128 R68, [R0+-0x100]
      IADD3 R86, R86, 0x200, RZ
      LDS.128 R72, [R2+-0x100]
      ISETP.NE.AND P1, PT, R86, 0x2000, PT
      LDS.128 R76, [R2]
      IADD3 R2, R2, 0x200, RZ
      STL.128 [R81+-0x10], R68
      FFMA R4, R68, R72, R4
      FFMA R8, R68, R73, R8
      FFMA R12, R68, R74, R12
      FFMA R16, R68, R75, R16
      FFMA R20, R68, R76, R20
      FFMA R24, R68, R77, R24
      FFMA R28, R68, R78, R28
      FFMA R32, R68, R79, R32
      FFMA R5, R69, R72, R5
      FFMA R9, R69, R73, R9
      FFMA R13, R69, R74, R13
      FFMA R17, R69, R75, R17
      FFMA R21, R69, R76, R21
      FFMA R25, R69, R77, R25
      FFMA R29, R69, R78, R29
      FFMA R33, R69, R79, R33
      FFMA R6, R70, R72, R6
      FFMA R10, R70, R73, R10
      FFMA R14, R70, R74, R14
      FFMA R18, R70, R75, R18
      FFMA R22, R70, R76, R22
      FFMA R26, R70, R77, R26
      FFMA R30, R70, R78, R30
      FFMA R34, R70, R79, R34
      FFMA R7, R71, R72, R7
      FFMA R11, R71, R73, R11
      FFMA R15, R71, R74, R15
      FFMA R19, R71, R75, R19
      FFMA R23, R71, R76, R23
      FFMA R27, R71, R77, R27
      FFMA R31, R71, R78, R31
      FFMA R35, R71, R79, R35
      LDS.128 R68, [R0]
      IADD3 R0, R0, 0x200, RZ
      STL.128 [R81], R68
      FFMA R36, R68, R72, R36
      FFMA R40, R68, R73, R40
      FFMA R44, R68, R74, R44
      STL.128 [R82+-0x10], R72
      FFMA R48, R68, R75, R48
      FFMA R52, R68, R76, R52
      FFMA R56, R68, R77, R56
      STL.128 [R82], R76
      FFMA R60, R68, R78, R60
      FFMA R64, R68, R79, R64
      FFMA R37, R69, R72, R37
      FFMA R41, R69, R73, R41
      FFMA R45, R69, R74, R45
      FFMA R49, R69, R75, R49
      FFMA R53, R69, R76, R53
      FFMA R57, R69, R77, R57
      FFMA R61, R69, R78, R61
      FFMA R65, R69, R79, R65
      FFMA R38, R70, R72, R38
      FFMA R42, R70, R73, R42
      FFMA R46, R70, R74, R46
      FFMA R50, R70, R75, R50
      FFMA R54, R70, R76, R54
      FFMA R58, R70, R77, R58
      FFMA R62, R70, R78, R62
      FFMA R66, R70, R79, R66
      FFMA R39, R71, R72, R39
      FFMA R43, R71, R73, R43
      FFMA R47, R71, R74, R47
      FFMA R51, R71, R75, R51
      FFMA R55, R71, R76, R55
      IADD3 R81, R81, 0x20, RZ
      FFMA R59, R71, R77, R59
      FFMA R63, R71, R78, R63
      FFMA R67, R71, R79, R67
      IADD3 R82, R82, 0x20, RZ
@P1   BRA 0x7e4355f95c20
      BAR.SYNC.DEFER_BLOCKING 0x1, 0x100
@P0   BRA 0x7e4355f95a30
      S2R R0, SR_TID.X
      IMAD R83, R83, 0x1000, R84
      ULDC.64 UR4, c[0x0][0x220]
      IMAD.SHL.U32 R83, R83, 0x80, RZ
      IMAD R0, R0, 0x100, R0
      IMAD.SHL.U32 R0, R0, 0x4, RZ
      LOP3.LUT R0, R0, 0x3c03c, RZ, 0xc0, !PT
      IADD3 R3, P0, R0, R83, RZ
      LEA.HI.X.SX32 R0, R83, RZ, 0x1, P0
      LEA R2, P0, R3, UR4, 0x2
      LEA.HI.X R3, R3, UR5, R0, 0x2, P0
      STG.E.128 desc[UR6][R2.64], R4
      STG.E.128 desc[UR6][R2.64+0x100], R36
      STG.E.128 desc[UR6][R2.64+0x4000], R8
      STG.E.128 desc[UR6][R2.64+0x4100], R40
      STG.E.128 desc[UR6][R2.64+0x8000], R12
      STG.E.128 desc[UR6][R2.64+0x8100], R44
      STG.E.128 desc[UR6][R2.64+0xc000], R16
      STG.E.128 desc[UR6][R2.64+0xc100], R48
      STG.E.128 desc[UR6][R2.64+0x100000], R20
      STG.E.128 desc[UR6][R2.64+0x100100], R52
      STG.E.128 desc[UR6][R2.64+0x104000], R24
      STG.E.128 desc[UR6][R2.64+0x104100], R56
      STG.E.128 desc[UR6][R2.64+0x108000], R28
      STG.E.128 desc[UR6][R2.64+0x108100], R60
      STG.E.128 desc[UR6][R2.64+0x10c000], R32
      STG.E.128 desc[UR6][R2.64+0x10c100], R64
      EXIT
      BRA 0x7e4355f962f0
      NOP
      NOP
      NOP
      NOP
      NOP
      NOP
      NOP
      NOP

There could be aliasing concerns by the compiler that source and target arrays overlap in local memory?

For function parameters there is the __ restrict __ keyword.

But could also be something different.

1 Like