PTX instructions are reordered

I typed these PTX codes to the NVCC compiler:

ld.global.L2::128B xx
wmma.mma.sync xx

ld.global.L2::128B xx
wmma.mma.sync xx

ld.global.L2::128B xx
wmma.mma.sync xx

Here are the final SASS codes I got:

 LDG.E.LTC128B.128.SYS R20, [R20] 
 LDG.E.LTC128B.128.SYS R16, [R16] 
 LDG.E.LTC128B.128.SYS R24, [R24] 

 HMMA.1688.F16 R40, R38, R48, R40 
 LDSM.16.M88.2 R30, [R65+0x30] 
 HMMA.1688.F16 R44, R38, R46, R44 
 LDSM.16.M88.2 R38, [R65+0x40] 
 HMMA.1688.F16 R48, R42, R49, R40 
 LDSM.16.M88.2 R40, [R66+0x40] 
 HMMA.1688.F16 R44, R42, R47, R44 

But I wish the final SASS codes should be like these:

 LDG.E.LTC128B.128.SYS R20, [R20] 
 HMMA.1688.F16 R40, R38, R48, R40 
 LDSM.16.M88.2 R30, [R65+0x30] 
 HMMA.1688.F16 R44, R38, R46, R44 
 LDSM.16.M88.2 R38, [R65+0x40] 
 HMMA.1688.F16 R48, R42, R49, R40 

 LDG.E.LTC128B.128.SYS R20, [R20] 
 HMMA.1688.F16 R40, R38, R48, R40 
 LDSM.16.M88.2 R30, [R65+0x30] 
 HMMA.1688.F16 R44, R38, R46, R44 
 LDSM.16.M88.2 R38, [R65+0x40] 
 HMMA.1688.F16 R48, R42, R49, R40

ptxas, the part of the compiler that transforms PTX into machine code (SASS), is an optimizing compiler. It schedules instructions in the order that it believes will maximize performance, based on a multitude of heuristics. These heuristics are correct in the vast majority of cases.

One of the optimizations ptxas applies is scheduling loads early. Loads are instructions with long and variable latency. While SIMT can absorb most of that latency, scheduling loads early will absorb some more, which often helps and thus makes the code more robust from a performance perspective. On some architectures, batching the loads also leads to more efficient use of internal load/store hardware structures. The drawback of early loads is an increase in register pressure, but as long as ptxas determines that this is unlikely to cause problems, it will go ahead.

NVIDIA does not provide tools to the public for writing code at SASS level.

Thanks, @njuffa. I understand your point.

To my profiling result, if all loads are grouped together at the beginning of GEMM, the Tensor Core’s pipe utilization will be lower, since all issued instructions are dispatched to LD/ST units at this moment. So I want to mix LD instructions and MMA instructions to keep Tensor Core busy all the time.

As you said, SASS coding is not supported by Nvidia officially. And TuringAs is not ready yet. I think it’s time to stop my optimization work at this step.

wouldn’t inserting memory barriers at strategic locations prevent such reordering?

@cbuchner1 , great point! SASS order changed with barriers. But this is only a debug version, I need more work to prove this is positive to performance. Also, I am planning to use mma PTX instructions instead of wmma.

 LDG.E.LTC128B.128.SYS R16, [R16]  #1st LDG
 BAR.ARV 0x0, 0x100 
 LDSM.16.M88.2 R36, [R63+0x20] 
 LDSM.16.M88.2 R38, [R64+0x20] 
 LDSM.16.M88.2 R40, [R64+0x4a0] 
 LDSM.16.M88.2 R42, [R63+0x30] 
 LDG.E.LTC128B.128.SYS R20, [R20]  #2nd LDG

LDSMs are inserted between two LDG.

Hi, I want to know that is there any way to forbid the computaiton instruction reordering?
For example, I have consecutive instructions

FADD R0, R1, R2
FADD R3, R4, R5
FADD R6, R7, R8

I don’t wan’t the compiler to reorder them, Is there any workaround?

What is your reason?

  • You want to create micro-benchmarks for academic purposes and need a specific SASS instruction order? → Manually patch the SASS. Official support is not needed.

  • You want to improve the performance of a production system beyond what ptxas optimizes? This is much more difficult. Especially if generally spoken. See the suggestion above with inserting memory barriers. Also unrolling or preventing unrolling of loops has an effect.

If you go into more details about your intent, you could get better feedback.

The three FADDs look quite harmless; why are they not optimal in the shown order? They have no dependencies.

Thank you. I am working on a machine learning compiling system where I want to keep the order of instructions in CUDA code, in order to control the register bank conflict, register reuse, data dependency, etc. The interfere of reorder optimization means that the system will not act as what I wish it to do.

These optimizations could go against what I mean to do. For example, It may unroll loop and batch computation instructions, which use more registers and reduce the occupancy.

In all, I want to control the SASS code compiled from CUDA code. Is it possible to turn off some optimizations such as reordering/unrolling?

If you run ptxas you get something simillar as ptxas • help with different options, which you can forward from nvcc: NVIDIA CUDA Compiler Driver

Loop unrolling can be activated with #pragma unroll and deactivated with #pragma unroll 1, see CUDA C++ Programming Guide

Perhaps you can confuse ptxas with defining a local array of registers, which you only access with a constant index and a loop, which you actually go through only once, but ptxas does not know. It tries to keep the same registers for each iteration of the loop.

float myregs[40];
for (int i = 0; i < param; i++) { // param always 1
    myregs[5] = myregs[8] - myregs[22];
}

Or trick it to not reorder by introducing jump targets, which are never used.

switch (param) { // param always 0
    case 0:
        a = b + c;
        // no break
    case 1:
        e = b + d;
    case 2:
    // ...
}
1 Like

Thanks! I tried that and it has some effect of preventing reorder. However it brings many ISETP and BRA instructions which harms the performance.

Also it is hard to control the register allocation. For example we write in CUDA:

// g is a global memory pointer
float r[8];

r[0] = g[0];
r[1] = g[1];
r[2] = g[2];
r[3] = g[3];

r[4] += r[0] * r[2];
r[5] += r[0] * r[3];
r[6] += r[1] * r[2];
r[7] += r[1] * r[3];

This is compiled to

                                                                                   /* 0x000fca0000000f00 */
        /*00d0*/                   LDG.E.STRONG.SYS R7, [R2.64] ;                /* 0x0000000402077981 */
                                                                                 /* 0x000ea8000c1f5900 */
        /*00e0*/                   LDG.E.STRONG.SYS R9, [R2.64+0x4] ;            /* 0x0000040402097981 */
                                                                                 /* 0x000ee8000c1f5900 */
        /*00f0*/                   LDG.E.STRONG.SYS R8, [R2.64+0x8] ;            /* 0x0000080402087981 */
                                                                                 /* 0x000ea8000c1f5900 */
        /*0100*/                   LDG.E.STRONG.SYS R10, [R2.64+0xc] ;           /* 0x00000c04020a7981 */
                                                                                 /* 0x000f22000c1f5900 */
        /*0110*/                   IADD3 R0, R0, 0x1, RZ ;                       /* 0x0000000100007810 */
                                                                                 /* 0x000fc80007ffe0ff */
        /*0120*/                   ISETP.GE.AND P0, PT, R0, c[0x0][0x168], PT ;  /* 0x00005a0000007a0c */
                                                                                 /* 0x000fe20003f06270 */
        /*0130*/                   FFMA R6, R7, R8.reuse, R6 ;                   /* 0x0000000807067223 */
                                                                                 /* 0x084fe20000000006 */
        /*0140*/                   FFMA R4, R9.reuse, R8, R4 ;                   /* 0x0000000809047223 */
                                                                                 /* 0x048fe20000000004 */
        /*0150*/                   FFMA R11, R9, R10.reuse, R11 ;                /* 0x0000000a090b7223 */
                                                                                 /* 0x090fe2000000000b */
        /*0160*/                   FFMA R5, R7, R10, R5 ;                        /* 0x0000000a07057223 */

The allocated register index is not consistent with what we defined: r[0-3] should be consecutive according to our CUDA code but it is compiled to R7, R9, R8, R10 in this piece of code, which is not consecutive. Let alone the compiler will use more registers to represent the same variable or reuse the same register… All of these things are out of controll.

It seems the only solution is to patch the SASS code? Are there any mannuals for learning it?

SASS is not intended to be directly end-user patchable, or modifiable using NVIDIA provided tools. There is no NVIDIA provided SASS assembler, for example. There are assemblers that have been created by others, these will probably require some practice and expertise to use, and may not give you the results that you want, all things considered.

You cannot simultaneously prevent the ptxas tool from doing its work to prevent reordering, while at the same time get all the compiler benefits towards optimization and performance.

1 Like

For that often the trick with using a loop helps. With it the registers can be reused.
You could alternatively also try, whether you get better results with PTX inline assembly.
The PTX abstract machine has an infinite amount of registers, so normal C++ code is typically compiled to write to each register only once (except in e.g. the loop case mentioned above).
With PTX code you can directly repeat the register number giving a stronger hint to the assembler.
Also try to limit maxrregcount to a good fit for the number of registers you want to use.

The situation is really not ideal for low-level control.

You can access a pair of variables as 64-bit values (e.g. with reinterpret_cast) in (not known by the assembler) unreached code, e.g. as long long int, double or as pointer. With it you force the assembler to create a consecutive pair of registers with the first register having an even register number. With it you can control the two register banks. In your own code you still access them as single 32-bit registers.
(But there is no final guarantee that ptxas does not add MOV instructions to copy the registers in the end.)

I would directly patch SASS code only for simple instructions (e.g. your FADD/FFMA chains), if you really need for your application. The instruction words contain fixed latencies and other hints and controls from the compiler. (Although I think it would mostly just affect performance, if those controls are not correct.)

1 Like

Not possible. Likewise, using a compiler like clang or gcc one cannot control x86-64 or ARMv8 assembly code from C++ source code. What works at times is to “massage” HLL code into something that results in machine code closer to what a human envisioned. I have 20+ years of experience with that (x86-64, ARMv7, and GPUs), and found that the approach is super brittle. Usually the next major revision of the compiler destroys the carefully exploited code generation artifact, and one gets to “massage” the HLL code again.

Also, in my experience, about half the time one finds that assembly code carefully crafted by a human still gets beaten in performance by compiler-generated code, and I would consider myself no dummy when it comes to programming performance-optimized code at the assembly language level, having done that since the time of the Z80 processor some 40 years ago.

The reason is usually that modern compilers incorporate more knowledge about the processor hardware and applicable code transformations than a single person can accumulate. This used to be different in the past, e.g. gcc on ARMv7 around 2005.

2 Likes