On the register allocation optimization of cuda compiler

As I know, when optimization is enabled, the compiler will try to use as less registers as possible for better occupancy.
However, for every architecture, there are at least some amount of registers available even for full occupancy.
For example, for kerpler/maxwell/pascal, there are 65536 32bit registers per SM, which can hold at most 2048 threads.
Thus there are at least 32 32bit registers per thread. For GT102, every SM can hold only up to 1024 threads, which means there are at least 64 32bit registers available.

AFAIK, cuda is not capable of register renaming, thus those unused registers are totally idle and wasted. For my case, I have a bunch of long-life-range registers, then the compiler just ignores the lower limit of register number, and make heavy use of quite a few short-life-range registers as temporary variables in almost every instruction. This will cause severe dependency problem and pipeline stalls too often. That hurts the performance badly, since I don’t have that plenty of threads to hide the latency. I need more ILP to help. But I cannot control the register allocation in PTXAS, and no native assembler available for turing (I knew maxas for maxwell, but it’s in perl).

Is there any plan that ptxas will utilize those idle registers more rather than reducing the register number as less as possible? Maybe it’s not just renaming the registers, some rearrangements may also be required to gain more ILP.
If NVIDIA could provide a native turing assembler, then I can do this manually, so that also helps~

How do you know this? Are you basing this simply on a visual inspection of the output of cuobjdump --dump-sass? What does the CUDA profiler say about bottlenecks in your application?

For the most part, GPUs issue one instruction per cycle per thread. Depending on architecture, sometimes two. I am pretty sure never more than two. ILP is not something worthy of much attention on GPUs. Based on you description, you seem to be saying you want better instruction scheduling, so that latency tolerance is increased by scheduling dependent instructions further apart. In my experience, the compiler typically makes good choices here. If you can demonstrate that the choice is poor, consider filing an enhancement request with NVIDIA via the bug-reporting page.

How many threads exactly? Your use case may not be suitable for the GPU; GPUs are not universal accelerators. If this is just one small short-running kernel sandwiched between heftier ones, it might still make sense to keep the work on the GPU if the data is already resident there.

CUDA has been around for a dozen years, and NVIDIA has never made a SASS assembler available for any GPU architecture during that time.

Thank you for your reply~ njuffa

Most of the stalls are caused by “Execution Dependency” (>50%), as told by the profiler. I usually have a few hundreds of threads, maybe not sufficient to cover the memory latency, probably the pipeline stalls too often.

Let me make a simple example demostrating the effect of utilizing more registers.
Suppose we need to compute the cubic of three variables %fR<3> = %fV<3>**3, but we only have 1 temporary register %fT0 available:

.reg .f32 %fT0;
    .reg .f32 %fV<3>;
    .reg .f32 %fR<3>;

    mul.rn.f32 %fT0, %fV0, %fV0;   // Set barrier 0
    mul.rn.f32 %fR0, %fT0, %fV0;   // Wait barrier 0

    mul.rn.f32 %fT0, %fV1, %fV1;   // Set barrier 0
    mul.rn.f32 %fR1, %fT0, %fV1;   // Wait barrier 0

    mul.rn.f32 %fT0, %fV2, %fV2;   // Set barrier 0
    mul.rn.f32 %fR2, %fT0, %fV2;   // Wait barrier 0

Thus all the evaluation will be serialized. But if we use 3 temporary registers %fT<3>, they could run in parallel:

.reg .f32 %fT<3>;
    .reg .f32 %fV<3>;
    .reg .f32 %fR<3>;

    mul.rn.f32 %fT0, %fV0, %fV0;  // Set barrier 0
    mul.rn.f32 %fT1, %fV1, %fV1;  // Set barrier 1
    mul.rn.f32 %fT2, %fV2, %fV2;  // Set barrier 2

    mul.rn.f32 %fR0, %fT0, %fV0;  // Wait barrier 0
    mul.rn.f32 %fR1, %fT1, %fV1;  // Wait barrier 1
    mul.rn.f32 %fR2, %fT2, %fV2;  // Wait barrier 2

This requires only one thirds warps to cover the latency, comparing with previous case.
It seems to me that the compiler will prefer the first choice since it uses less registers.

For my real case, I have quite a lot of long-life-range registers (25+), thus the compiler will try to use just a few registers to evaluate most instructions (I find <32 is preferred, may be due to the legacy architecture?). But since I don’t have that many threads, there are plenty of registers available. Setting the CTA limit (.maxntid, .reqntid, etc.) in PTX seems no help at all. So I think that’s a drawback of the compiler.

It seems that currently I can only hack the cubin and modify the SASS manually~

I am not sure why your example is showing PTX, which is a virtual architecture and a compiler intermediate format. As such, the compiler will use an SSA (static single assignment) style of virtual register usage, i.e. each register is only written to once. All analysis of register usage and instruction scheduling must take place at the SASS (machine code) level, because PTX is compiled, by an optimizing compiler PTXAS, to SASS.

The kind of instruction-level parallelism between single-precision multiplies you envision in your example pretty much does not occur in GPUs. To first order, GPUs execute one instruction per thread per cycle. There are various exceptions to this that depend on architecture and allow limited dual issue, but they usually involve instructions that issue to different execution units, which is not the case in your example. PTXAS is already aware of specific dual-issue capabilities, as it includes the architecture-specific optimizer components of the compiler.

By observation, register allocation heuristics in PTXAS are architecture specific, although the heuristics for some architectures appear to be shared. Best I can tell, PTXAS appears to use the same strategy for Pascal and Maxwell, for example. It is clear from looking at generated code that the compiler is aware of long-latency instructions and will schedule such instructions early if data and control dependencies allow and increased register pressure is tolerable. Like any heuristic-driven process, register allocation and instruction scheduling cannot be expected to be perfect for any given piece of code, but will usually be within 90% of optimal for the vast majority of codes.

Your quest for increased instruction-level parallelism is not well suited to GPUs, which, on a per-thread view, are not wide superscalar out-of-order processors, but rather almost-scalar ones. I would suggest focusing on thread-level parallelism, which is what GPUs are designed for. Using only a few hundred threads will starve the machine, and you are seeing the consequences in the profiler statistics. Your goal would be to utilize on the order of ten thousand threads.

Seems we have different understanding of ILP. To me, instructions without unresolved dependency can be issued before the previous instruction is completed. The dual-issue is one type of ILP, but not all of it. I don’t think it’s OOO, since the instruction is still issued in order. Only when the next instruction has unresolved execution dependency, the warp becomes ineligible to the warp scheduler. When all warps are ineligible, the warp schedulers become idle and the performance drops.

You can refer to “Better Performance at Lower Occupancy” by volkov in GTC2010 for more details:
https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf
It’s mainly about Fermi and Kerpler, long long ago, but it’s already cited in the official documentation:
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#thread-and-block-heuristics
So I think the conclusion still holds for newer architectures.

BTW, I meant I have a few hundred of threads per block, actually I get thousands of blocks. So GPU is definitely the right choice. I can’t increase the threads per block due to a lot of implementation issues. So I try to increase the performance by better ILP with fewer threads, which is just what volkov tried. But the register allocation strategy introduced too many dependencies by utilizing just a few registers frequently, hence the pipeline stalls too often and the performance is poor. If the compiler can utilize those idle registers, many of those dependencies can be resolved and better ILP can be achieved.

Well, seems not possible to get any help from NVIDIA, I’m trying to hack SASS by myself~

An assembler or a tool to hack SASS directly has never been part of the CUDA toolchain provided by NVIDIA.

I concur with njuffa that providing an analysis of PTX is unconvincing. PTX is an intermediate format that is compiled. Once compiled, the actual (SASS) register usage may have little resemblance to the PTX virtual register usage.

The NVIDIA tools don’t seek to minimize register usage, contrary to your claims. The NVIDIA tools seek to maximize code performance, subject to an upper bound of 255 registers per thread (although that could change).

If you observe that a code is using (for example) 18 registers per thread (and you’ve not provided any register caps such as launch_bounds or -maxrrregcount), then it means the compiler does not know how to use additional registers to make the code go faster. AFAIK the compiler does not consider occupancy, or approach your code compilation from the standpoint of considering what multiple threads may do. In the general case, the compiler does not know how many threads per block you intend to launch, and attempting to adjust register usage for occupancy without that information would be simply incorrect.

If you have a well organized proposal for how performance could be improved based on SASS analysis, you could provide that as a suggestion/RFE via the bug reporting tool, which is identified in a sticky post at the top of this forum.

I am well familiar with Volkov’s paper, and I have spoken to Mr. Volkov on a number of occasions.

We are in full agreement there. The fundamental mechanism for covering latency in GPUs is thread-level parallelism with zero-overhead switching. That the threads are grouped into warps is really secondary to this fundamental principle. Therefore, the primary goal for CUDA programmers is to get lots of threads going. Volkov’s main point is that there is no strong correlation between occupancy and performance, and this is true, but there is nonetheless a decent amount of correlation.

On GPUs with their almost-scalar in-order pipeline, ILP can help with latency covering to a relatively minor degree, and the CUDA compiler already tries to exploit what is possible based on data and control dependencies in the source code, mostly via advantageous instruction scheduling. As Robert Crovella states, unless constrained by the programmer, the compiler will use as many registers as it determines is necessary to achieve the best performance.

I have been looking at a lot of generated machine code over the entire existence of CUDA and for about five years worked closely with the CUDA compiler team on performance issues caused by code generation. Based on that I am quite confident that the compiler is generally doing an excellent job at this time. That does not mean there couldn’t be issues in isolated cases; maybe you are hitting such a case.

If you have a concrete example, i.e. buildable source code that you can post, that could lead to a fruitful discussion. On the other hand, technical discussions based on vague problem descriptions are not productive in my experience.

As Robert Crovella points out, the CUDA compiler team welcomes ideas for enhancements, in particular those that demonstrate performance improvements of 5% or more, on either a reasonably wide selection of codes or on a particular important but narrower class of applications.

Wow! Really learned quite lot discussing with you, njuffa~ Thanks very much! And thanks Robert for your clearification~
Sorry for the confusion of the PTX code, I just copied some PTX snippets to demostrate using more temporary registers could gain more ILP, not mean to judge the register allocation by PTX code.

Now I need to check my starting point again. As I examined the SASS code of a kernel written in PTX, I found the core portion contains a long sequence of instructions utilizing the same register frequently, including read and write. It seems to me some could be run in parallel. As the profiler told me, the pipeline is mostly stalled due to execution dependency (>50%), hence I thought that’s the hotspot.

The register usuage for the kernel is roughly 34 for GT102, where at least 64 would be available for every thread. I tried to provide some performance hints to the entry function (such as .maxntid, .reqntid, etc., and also --maxrregcount for ptxas), but nothing changed. Thus I thought that may be caused by the compiler tried to reduce the register usuage too aggresively. Or maybe the complier thought the code is already optimum.

I can’t provide the original code, but I’ll try to reproduce the scenario with a simpler demo. Hopefully I could make a working example to identify this beheavier of the compiler. That would make this discussion more concrete and targeted.

After carefully examined the generated SASS code, I’m not sure it’s the problem of register allocation, or the compiler is reluctant or unable to break long dependency chains. Let’s see a simple concocted example:

__global__ void doLerp(const float* __restrict__ x, const float* __restrict__ y, float* __restrict__ v)
{
    unsigned int tid = threadIdx.x;
    float s = x[tid];
    s = max(min(s, 0.0f), 1022.99f);

    // Path 1: integer part
    int is = int(s) / 4 + 1;
    float v0 = y[is];
    float v1 = y[is + 1];

    // Path 2: float part
    float qs = s - floorf(s) - 0.5f;
    qs = __saturatef(qs * 2.0f);
    qs = qs*qs;

    // merge again
    v[tid] = fma(qs, v1, fma(qs, -v0, v0));
}

The program is based on the simple lerp. Please don’t get confused about the math here, I just added some additional operations to elongate the dependency chain, they won’t make too much sence in mathematics, but still cannot ignored by the compiler. So forget about the math here. We just need to check how the compiler handled those instructions (build by cuda9.2 for arch sm_52):

Function : _Z6doLerpPKfS0_Pf
    .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                   /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;                               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R9, R0.reuse, 0x2;                          /* 0x3848000000270009 */
                                                                                   /* 0x001fc800fec007f5 */
        /*0028*/                   SHR.U32 R8, R0, 0x1e;                           /* 0x3828000001e70008 */
        /*0030*/                   IADD R2.CC, R9, c[0x0][0x140];                  /* 0x4c10800005070902 */
        /*0038*/                   IADD.X R3, R8, c[0x0][0x144];                   /* 0x4c10080005170803 */
                                                                                   /* 0x001fc820fcc007b5 */
        /*0048*/                   LDG.E.CI R2, [R2];                              /* 0xeed4a00000070202 */
        /*0050*/                   FMNMX R0, RZ, R2, PT;                           /* 0x5c6003800027ff00 */
        /*0058*/                   FMNMX R0, R0, c[0x2][0x0], !PT;                 /* 0x4c60078800070000 */
                                                                                   /* 0x001fd801fec0071d */
        /*0068*/                   F2I.S32.F32.TRUNC R4, R0;                       /* 0x5cb0018000071a04 */  // path 1
        /*0070*/                   SHR R5, R4, 0x1f;                               /* 0x3829000001f70405 */  // R4 everywhere
        /*0078*/                   LEA.HI R4, R5, R4, RZ, 0x2;                     /* 0x5bdf7f8020470504 */
                                                                                   /* 0x001f8400fec007f6 */
        /*0088*/                   SHR R4, R4, 0x2;                                /* 0x3829000000270404 */
        /*0090*/                   IADD32I R4, R4, 0x1;                            /* 0x1c00000000170404 */
        /*0098*/                   SHR R5, R4, 0x1e;                               /* 0x3829000001e70405 */
                                                                                   /* 0x001c4800fe0007f6 */
        /*00a8*/                   ISCADD R2.CC, R4, c[0x0][0x148], 0x2;           /* 0x4c18810005270402 */
        /*00b0*/         {         IADD.X R3, R5, c[0x0][0x14c];                   /* 0x4c10080005370503 */
        /*00b8*/                   F2F.F32.F32.FLOOR R5, R0;        }              /* 0x5ca8048000070a05 */  // path 2
                                                                                   /* 0x001ec400f62007f0 */
        /*00c8*/         {         IADD R9.CC, R9, c[0x0][0x150];                  /* 0x4c10800005470909 */
        /*00d0*/                   LDG.E.CI R4, [R2];        }                     /* 0xeed4a00000070204 */
        /*00d8*/                   LDG.E.CI R6, [R2+0x4];                          /* 0xeed4a00000470206 */
                                                                                   /* 0x001f9800ffa00ff6 */
        /*00e8*/                   FADD R5, R0, -R5;                               /* 0x5c58200000570005 */  // R5 everywhere
        /*00f0*/                   FADD R5, R5, -0.5;                              /* 0x3958003f00070505 */
        /*00f8*/                   FADD.SAT R5, R5, R5;                            /* 0x5c5c000000570505 */
                                                                                   /* 0x141fc400fea007f1 */
        /*0108*/                   FMUL R7, R5, R5;                                /* 0x5c68000000570507 */
        /*0110*/                   IADD.X R5, R8, c[0x0][0x154];                   /* 0x4c10080005570805 */
        /*0118*/                   FFMA R8, R4, -R7.reuse, R4;                     /* 0x5981020000770408 */
                                                                                   /* 0x001fc400fe4007f5 */
        /*0128*/                   MOV R4, R9;                                     /* 0x5c98078000970004 */
        /*0130*/                   FFMA R6, R6, R7, R8;                            /* 0x5980040000770606 */
        /*0138*/                   STG.E [R4], R6;                                 /* 0xeedc200000070406 */
                                                                                   /* 0x001f8000ffe007ff */
        /*0148*/                   EXIT;                                           /* 0xe30000000007000f */
        /*0150*/                   BRA 0x150;                                      /* 0xe2400fffff87000f */
        /*0158*/                   NOP;                                            /* 0x50b0000000070f00 */
                                                                                   /* 0x001f8000fc0007e0 */
        /*0168*/                   NOP;                                            /* 0x50b0000000070f00 */
        /*0170*/                   NOP;                                            /* 0x50b0000000070f00 */
        /*0178*/                   NOP;                                            /* 0x50b0000000070f00 */
        ..................................

The progam contains two independent pathes, diverged from the fraction index, and merged at final lerp.
Every path consists of a long chain of back-to-back dependent instructions. Those two pathes are totally independent and could be mixed or interleaved to alleviate the dependency problem. But the compiler utilized one registers heavily (R4 for path 1, and R5 for path 2) to build almost all the instructions in that path.

I tried to rearrange the source code to help the compiler to interleave those instructions:

__global__ void doLerp_Mix(const float* __restrict__ x, const float* __restrict__ y, float* __restrict__ v)
{
    unsigned int tid = threadIdx.x;
    float s = x[tid];
    s = max(min(s, 0.0f), 1022.99f);

    int is = int(s) / 4 + 1;
    float qs = s - floorf(s) - 0.5f;

    float v0 = y[is];
    qs = __saturatef(qs * 2.0f);

    float v1 = y[is + 1];
    qs = qs*qs;

    // merge again
    v[tid] = fma(qs, v1, fma(qs, -v0, v0));
}

But the generated code does not seem to change at all:

Function : _Z10doLerp_MixPKfS0_Pf
    .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                   /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                           /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;                               /* 0xf0c8000002170000 */
        /*0018*/                   SHL R9, R0.reuse, 0x2;                          /* 0x3848000000270009 */
                                                                                   /* 0x001fc800fec007f5 */
        /*0028*/                   SHR.U32 R8, R0, 0x1e;                           /* 0x3828000001e70008 */
        /*0030*/                   IADD R2.CC, R9, c[0x0][0x140];                  /* 0x4c10800005070902 */
        /*0038*/                   IADD.X R3, R8, c[0x0][0x144];                   /* 0x4c10080005170803 */
                                                                                   /* 0x001fc820fcc007b5 */
        /*0048*/                   LDG.E.CI R2, [R2];                              /* 0xeed4a00000070202 */
        /*0050*/                   FMNMX R0, RZ, R2, PT;                           /* 0x5c6003800027ff00 */
        /*0058*/                   FMNMX R0, R0, c[0x2][0x0], !PT;                 /* 0x4c60078800070000 */
                                                                                   /* 0x001fd801fec0071d */
        /*0068*/                   F2I.S32.F32.TRUNC R4, R0;                       /* 0x5cb0018000071a04 */
        /*0070*/                   SHR R5, R4, 0x1f;                               /* 0x3829000001f70405 */
        /*0078*/                   LEA.HI R4, R5, R4, RZ, 0x2;                     /* 0x5bdf7f8020470504 */
                                                                                   /* 0x001f8400fec007f6 */
        /*0088*/                   SHR R4, R4, 0x2;                                /* 0x3829000000270404 */
        /*0090*/                   IADD32I R4, R4, 0x1;                            /* 0x1c00000000170404 */
        /*0098*/                   SHR R5, R4, 0x1e;                               /* 0x3829000001e70405 */
                                                                                   /* 0x001c4800fe0007f6 */
        /*00a8*/                   ISCADD R2.CC, R4, c[0x0][0x148], 0x2;           /* 0x4c18810005270402 */
        /*00b0*/         {         IADD.X R3, R5, c[0x0][0x14c];                   /* 0x4c10080005370503 */
        /*00b8*/                   F2F.F32.F32.FLOOR R5, R0;        }              /* 0x5ca8048000070a05 */
                                                                                   /* 0x001ec400f62007f0 */
        /*00c8*/         {         IADD R9.CC, R9, c[0x0][0x150];                  /* 0x4c10800005470909 */
        /*00d0*/                   LDG.E.CI R4, [R2];        }                     /* 0xeed4a00000070204 */
        /*00d8*/                   LDG.E.CI R6, [R2+0x4];                          /* 0xeed4a00000470206 */
                                                                                   /* 0x001f9800ffa00ff6 */
        /*00e8*/                   FADD R5, R0, -R5;                               /* 0x5c58200000570005 */
        /*00f0*/                   FADD R5, R5, -0.5;                              /* 0x3958003f00070505 */
        /*00f8*/                   FADD.SAT R5, R5, R5;                            /* 0x5c5c000000570505 */
                                                                                   /* 0x141fc400fea007f1 */
        /*0108*/                   FMUL R7, R5, R5;                                /* 0x5c68000000570507 */
        /*0110*/                   IADD.X R5, R8, c[0x0][0x154];                   /* 0x4c10080005570805 */
        /*0118*/                   FFMA R8, R4, -R7.reuse, R4;                     /* 0x5981020000770408 */
                                                                                   /* 0x001fc400fe4007f5 */
        /*0128*/                   MOV R4, R9;                                     /* 0x5c98078000970004 */
        /*0130*/                   FFMA R6, R6, R7, R8;                            /* 0x5980040000770606 */
        /*0138*/                   STG.E [R4], R6;                                 /* 0xeedc200000070406 */
                                                                                   /* 0x001f8000ffe007ff */
        /*0148*/                   EXIT;                                           /* 0xe30000000007000f */
        /*0150*/                   BRA 0x150;                                      /* 0xe2400fffff87000f */
        /*0158*/                   NOP;                                            /* 0x50b0000000070f00 */
                                                                                   /* 0x001f8000fc0007e0 */
        /*0168*/                   NOP;                                            /* 0x50b0000000070f00 */
        /*0170*/                   NOP;                                            /* 0x50b0000000070f00 */
        /*0178*/                   NOP;                                            /* 0x50b0000000070f00 */
        .......................................

The registers don’t worth a concern since it took only 10 registers here, far below the minimum avaiable number 32 for sm_52.It seems to me the compiler chose this type of code deliberately other than limited by the order of input sequences.

I don’t quite understand the strategy the compiler used here, but I thought there could be some benefits to interleave those two pathes, providing that several extra registers are available. I have not check it for latetest sm_75, there could be even more benefits since float and integer units can be utilized concurrently in turing. I’ve confirmed the compiler is capable of mixing independent float and integer operations for some occasions, but I don’t know what prevents the compiler doing so for this case. Probably some other performance issues? If this is the way it designed to be, would you please explain why?

doLerp() appears to be very much bound by data movement. There are three global memory loads and one global memory store in about 30 instructions total. What does the profiler say?

By the way, shouldn’t the clamping logic be:

s = max (min (s, 1022.99f), 0.0f);

I assume the purpose is to limit ‘s’ to the interval [0,1023). Assuming that’s correct, you can then change the type of ‘is’ to ‘unsigned int’, saving some instructions. Not that it should matter, because I expect this kernel to be entirely memory bound.

When I compile your (first i.e. doLerp) code on CUDA 10 for sm_52, I get this output from cuobjdump -sass:

$ cuobjdump -sass  t1.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_52
                Function : _Z6doLerpPKfS0_Pf
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                           /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                  /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X ;                      /* 0xf0c8000002170000 */
        /*0018*/                   SHL R8, R0.reuse, 0x2 ;                 /* 0x3848000000270008 */
                                                                           /* 0x001fc800fec007f5 */
        /*0028*/                   SHR.U32 R9, R0, 0x1e ;                  /* 0x3828000001e70009 */
        /*0030*/                   IADD R4.CC, R8, c[0x0][0x140] ;         /* 0x4c10800005070804 */
        /*0038*/                   IADD.X R5, R9, c[0x0][0x144] ;          /* 0x4c10080005170905 */
                                                                           /* 0x001fc820fcc007b5 */
        /*0048*/                   LDG.E.CI R4, [R4] ;                     /* 0xeed4a00000070404 */
        /*0050*/                   FMNMX R0, RZ, R4, PT ;                  /* 0x5c6003800047ff00 */
        /*0058*/                   FMNMX R0, R0, c[0x2][0x0], !PT ;        /* 0x4c60078800070000 */
                                                                           /* 0x001c5801fe00071d */  
        /*0068*/                   F2I.S32.F32.TRUNC R2, R0 ;              /* 0x5cb0018000071a02 */  FROM PATH 1
        /*0070*/         {         SHR R3, R2, 0x1f ;                      /* 0x3829000001f70203 */
        /*0078*/                   F2F.F32.F32.FLOOR R5, R0         }                                FROM PATH 2
                                                                           /* 0x5ca8048000070a05 */ 
                                                                           /* 0x001fd800fec007f6 */
        /*0088*/                   LEA.HI R2, R3, R2, RZ, 0x2 ;            /* 0x5bdf7f8020270302 */
        /*0090*/                   SHR R2, R2, 0x2 ;                       /* 0x3829000000270202 */
        /*0098*/                   IADD32I R2, R2, 0x1 ;                   /* 0x1c00000000170202 */
                                                                           /* 0x001fc800fec007e1 */
        /*00a8*/                   SHR R3, R2, 0x1e ;                      /* 0x3829000001e70203 */
        /*00b0*/                   ISCADD R2.CC, R2, c[0x0][0x148], 0x2 ;  /* 0x4c18810005270202 */
        /*00b8*/                   IADD.X R3, R3, c[0x0][0x14c] ;          /* 0x4c10080005370303 */
                                                                           /* 0x003fc400f62007b1 */
        /*00c8*/                   LDG.E.CI R6, [R2] ;                     /* 0xeed4a00000070206 */
        /*00d0*/                   LDG.E.CI R7, [R2+0x4] ;                 /* 0xeed4a00000470207 */
        /*00d8*/                   FADD R5, R0, -R5 ;                      /* 0x5c58200000570005 */
                                                                           /* 0x001fd800fec007e5 */
        /*00e8*/                   IADD R4.CC, R8, c[0x0][0x150] ;         /* 0x4c10800005470804 */
        /*00f0*/                   FADD R5, R5, -0.5 ;                     /* 0x3958003f00070505 */
        /*00f8*/                   FADD.SAT R5, R5, R5 ;                   /* 0x5c5c000000570505 */
                                                                           /* 0x141fd800fea007f1 */
        /*0108*/                   FMUL R8, R5, R5 ;                       /* 0x5c68000000570508 */
        /*0110*/                   IADD.X R5, R9, c[0x0][0x154] ;          /* 0x4c10080005570905 */
        /*0118*/                   FFMA R6, R6, -R8.reuse, R6 ;            /* 0x5981030000870606 */
                                                                           /* 0x001ffc00fe2007e2 */
        /*0128*/                   FFMA R6, R7, R8, R6 ;                   /* 0x5980030000870706 */
        /*0130*/                   STG.E [R4], R6 ;                        /* 0xeedc200000070406 */
        /*0138*/                   EXIT ;                                  /* 0xe30000000007000f */
                                                                           /* 0x001f8000fc0007ff */
        /*0148*/                   BRA 0x140 ;                             /* 0xe2400fffff07000f */
        /*0150*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0158*/                   NOP;                                    /* 0x50b0000000070f00 */
                                                                           /* 0x001f8000fc0007e0 */
        /*0168*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0170*/                   NOP;                                    /* 0x50b0000000070f00 */
        /*0178*/                   NOP;                                    /* 0x50b0000000070f00 */
                ............................

Fatbin ptx code:
================
arch = sm_52
code version = [6,3]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

I’ve added my own annotation above on the right hand side FROM PATH 1/2

The compiler has mixed your two paths, and ordered an instruction from one path almost adjacent to an instruction from the other path.

Due to the limited dual-issue capabilities of the GPU, I don’t think much more can be gained here from “mixing” instructions from the two paths more vigorously. On wide-issue in-order superscalar processors it might be a different story.

In this case the simple change ‘int’ → ‘unsigned int’ for the type of ‘is’ should already result in bigger savings. The reason is that for signed integers division (rounding quotient towards zero) is not the same as right shift (rounding towards negative infinity): -1/2 = 0, but -1 >> 1 = -1.

Note that the general rule of integers is: all integers should be ‘int’ unless there is a darn good reason for them to be something else. In particular, using ‘unsigned int’ for indexing in a loop may interfere with optimizations due to the guaranteed wrap-around nature of ‘unsigned int’ operations.

[Later:] Sure enough, after I changed the code to this:

__global__ void doLerp(const float* __restrict__ x, const float* __restrict__ y, float* __restrict__ v)
 {
     unsigned int tid = threadIdx.x;
     float s = x[tid];
     s = max(min(s, 1022.99f), 0.0f);
     
     // Path 1: integer part
     int is = (unsigned int)s / 4 + 1;
     float v0 = y[is];
     float v1 = y[is + 1];
     
     // Path 2: float part
     float qs = s - floorf(s) - 0.5f;
     qs = __saturatef(qs * 2.0f);
     qs = qs*qs;
     
     // merge again
     v[tid] = fma(qs, v1, fma(qs, -v0, v0));
 }

I got tighter machine code, as shown below. The instruction scheduling by the compiler seems fine to me. It focuses on getting the loads issued as early as possible, and pulls up the floorf() computation from the “float part”.

code for sm_52
               Function : _Z6doLerpPKfS0_Pf
       .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                     /* 0x083fc400e3e007f6 */
       /*0008*/                   MOV R1, c[0x0][0x20];                              /* 0x4c98078000870001 */
       /*0010*/                   S2R R0, SR_TID.X;                                  /* 0xf0c8000002170000 */
       /*0018*/                   SHL R8, R0.reuse, 0x2;                             /* 0x3848000000270008 */
                                                                                     /* 0x001fc800fec007f5 */
       /*0028*/                   SHR.U32 R10, R0, 0x1e;                             /* 0x3828000001e7000a */
       /*0030*/                   IADD R4.CC, R8, c[0x0][0x140];                     /* 0x4c10800005070804 */
       /*0038*/                   IADD.X R5, R10, c[0x0][0x144];                     /* 0x4c10080005170a05 */
                                                                                     /* 0x001fc808fcc00072 */
       /*0048*/                   LDG.E.CI R0, [R4];                                 /* 0xeed4a00000070400 */
       /*0050*/                   FMNMX R0, R0, c[0x2][0x0], PT;                     /* 0x4c60038800070000 */
       /*0058*/                   FMNMX R0, RZ, R0, !PT;                             /* 0x5c6007800007ff00 */
                                                                                     /* 0x0000d802fe00003d */
       /*0068*/                   F2I.U32.F32.TRUNC R2, R0;                          /* 0x5cb0018000070a02 */
       /*0070*/         {         LEA.HI R2, R2, c[0x2][0x4], RZ, 0x1e;              /* 0x18f77f8800170202 */
       /*0078*/                   F2F.F32.F32.FLOOR R9, R0;        }                 /* 0x5ca8048000070a09 */
                                                                                     /* 0x001fc800fe2207f6 */
       /*0088*/                   LEA R6.CC, R2.reuse, c[0x0][0x148], 0x2;           /* 0x4bd7810005270206 */
       /*0090*/                   LEA.HI.X R3, R2, c[0x0][0x14c], RZ, 0x2;           /* 0x1a177f8005370203 */
       /*0098*/                   MOV R2, R6;                                        /* 0x5c98078000670002 */
                                                                                     /* 0x005f980052200271 */
       /*00a8*/                   LDG.E.CI R6, [R2];                                 /* 0xeed4a00000070206 */
       /*00b0*/                   LDG.E.CI R7, [R2+0x4];                             /* 0xeed4a00000470207 */
       /*00b8*/                   FADD R0, R0, -R9;                                  /* 0x5c58200000970000 */
                                                                                     /* 0x001fc400fec007f6 */
       /*00c8*/                   FADD R0, R0, -0.5;                                 /* 0x3958003f00070000 */
       /*00d0*/                   FADD.SAT R0, R0, R0;                               /* 0x5c5c000000070000 */
       /*00d8*/                   FMUL R0, R0, R0;                                   /* 0x5c68000000070000 */
                                                                                     /* 0x011f9800fe2007f6 */
       /*00e8*/                   IADD R4.CC, R8, c[0x0][0x150];                     /* 0x4c10800005470804 */
       /*00f0*/                   IADD.X R5, R10, c[0x0][0x154];                     /* 0x4c10080005570a05 */
       /*00f8*/                   FFMA R6, R6, -R0, R6;                              /* 0x5981030000070606 */
                                                                                     /* 0x001ffc001e2087f2 */
       /*0108*/                   FFMA R6, R7, R0, R6;                               /* 0x5980030000070706 */
       /*0110*/                   STG.E [R4], R6;                                    /* 0xeedc200000070406 */
       /*0118*/                   EXIT;                                              /* 0xe30000000007000f */
                                                                                     /* 0x001f8000fc0007ff */
       /*0128*/                   BRA 0x120;                                         /* 0xe2400fffff07000f */
       /*0130*/                   NOP;                                               /* 0x50b0000000070f00 */
       /*0138*/                   NOP;                                               /* 0x50b0000000070f00 */
               ..................................

Hmm… Seems the code generation for this kind of cases has been improved for the new version of cuda. Since the compiler is more like a blackbox for the end users, I need more time to get familiar with it, and then figure out what kind of input could yield better code generation, and why others not.

Haha! Absolutely! Sorry for this mistake~ And thanks for the useful suggestion to make the code more compact. The code is just for compiling to observe the SASS generation, thus I did not run it.

Actually, the problem I’m trying to propose here is not how to cover the latency of the global loading, neither is to save more instructions. The key point here is that wether the compiler is capable of interleaving the instructions from two independent pathes, hence to mitigate the potential frequent pipeline stall due to execution dependency. When there are enough warps to cover this stall or the global access latency is not covered completely, this may not affect the performance. But when you have less warps, those stalls will be too often to utilize the function units enough, at some level that costs more than enough to cover the global latency. Then the performance will finally drops.

To the compiler, if there are enough registers to make the interleaving happen, there is no harm to do so. That will probably make the program run faster for more launch configurations, especially small blocks without enough warps to cover the arithmetic latency.

You may just imagine a much longer dependent chain than stated in the code above, the global latency need not to be concerned but the conclusion still holds.