PTX instructions are reordered

I typed these PTX codes to the NVCC compiler:

ld.global.L2::128B xx
wmma.mma.sync xx

ld.global.L2::128B xx
wmma.mma.sync xx

ld.global.L2::128B xx
wmma.mma.sync xx

Here are the final SASS codes I got:

 LDG.E.LTC128B.128.SYS R20, [R20] 
 LDG.E.LTC128B.128.SYS R16, [R16] 
 LDG.E.LTC128B.128.SYS R24, [R24] 

 HMMA.1688.F16 R40, R38, R48, R40 
 LDSM.16.M88.2 R30, [R65+0x30] 
 HMMA.1688.F16 R44, R38, R46, R44 
 LDSM.16.M88.2 R38, [R65+0x40] 
 HMMA.1688.F16 R48, R42, R49, R40 
 LDSM.16.M88.2 R40, [R66+0x40] 
 HMMA.1688.F16 R44, R42, R47, R44 

But I wish the final SASS codes should be like these:

 LDG.E.LTC128B.128.SYS R20, [R20] 
 HMMA.1688.F16 R40, R38, R48, R40 
 LDSM.16.M88.2 R30, [R65+0x30] 
 HMMA.1688.F16 R44, R38, R46, R44 
 LDSM.16.M88.2 R38, [R65+0x40] 
 HMMA.1688.F16 R48, R42, R49, R40 

 LDG.E.LTC128B.128.SYS R20, [R20] 
 HMMA.1688.F16 R40, R38, R48, R40 
 LDSM.16.M88.2 R30, [R65+0x30] 
 HMMA.1688.F16 R44, R38, R46, R44 
 LDSM.16.M88.2 R38, [R65+0x40] 
 HMMA.1688.F16 R48, R42, R49, R40

ptxas, the part of the compiler that transforms PTX into machine code (SASS), is an optimizing compiler. It schedules instructions in the order that it believes will maximize performance, based on a multitude of heuristics. These heuristics are correct in the vast majority of cases.

One of the optimizations ptxas applies is scheduling loads early. Loads are instructions with long and variable latency. While SIMT can absorb most of that latency, scheduling loads early will absorb some more, which often helps and thus makes the code more robust from a performance perspective. On some architectures, batching the loads also leads to more efficient use of internal load/store hardware structures. The drawback of early loads is an increase in register pressure, but as long as ptxas determines that this is unlikely to cause problems, it will go ahead.

NVIDIA does not provide tools to the public for writing code at SASS level.

Thanks, @njuffa. I understand your point.

To my profiling result, if all loads are grouped together at the beginning of GEMM, the Tensor Core’s pipe utilization will be lower, since all issued instructions are dispatched to LD/ST units at this moment. So I want to mix LD instructions and MMA instructions to keep Tensor Core busy all the time.

As you said, SASS coding is not supported by Nvidia officially. And TuringAs is not ready yet. I think it’s time to stop my optimization work at this step.

wouldn’t inserting memory barriers at strategic locations prevent such reordering?

@cbuchner1 , great point! SASS order changed with barriers. But this is only a debug version, I need more work to prove this is positive to performance. Also, I am planning to use mma PTX instructions instead of wmma.

 LDG.E.LTC128B.128.SYS R16, [R16]  #1st LDG
 BAR.ARV 0x0, 0x100 
 LDSM.16.M88.2 R36, [R63+0x20] 
 LDSM.16.M88.2 R38, [R64+0x20] 
 LDSM.16.M88.2 R40, [R64+0x4a0] 
 LDSM.16.M88.2 R42, [R63+0x30] 
 LDG.E.LTC128B.128.SYS R20, [R20]  #2nd LDG

LDSMs are inserted between two LDG.