PTX jit spills registers in trivial programs

Dear NVIDIA team,

I’ve been chasing down bad performance in a one of my CUDA kernels and tracked it to a really odd behavior of the PTX JIT compared to ptxas. I know much has been said about how ptxas and the PTX jit are usually similar but not guaranteed to be the same – that is all well. However, in this case, I feel that the JIT is behaving in a really bizarre way, and something is deeply wrong:

I was able to crunch down the example to a setup where a 1-line kernel calls a single function once.

.version 7.8
.target sm_89
.address_size 64

.func my_func() {
    .reg .b32 %active, %x;
    mov.b32 %x, 0;
    activemask.b32 %active;
    match.any.sync.b32 %x, %x, %active;
    ret;
}

.entry my_kernel(.param .align 4 .b8 params[4]) {
    call.uni my_func, ();
}

Should be an easy case – the function will be inlined. And indeed, ptxas on the command line does that basically regardless of optimization parameters. Even at -O0.

$ /usr/local/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:13:51_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

$ /usr/local/cuda/bin/ptxas  --gpu-name sm_89 test.ptx -O0 -v
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info    : Function properties for my_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 7 registers, 356 bytes cmem[0]
ptxas info    : Function properties for my_func
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

Here is a simple C++ driver that instead tries to do the same thing with the JIT:

#include <stdio.h>
#include <cuda.h>
#include <string.h>

const char *source = R"(
.version 7.8
.target sm_89
.address_size 64

.func my_func() {
    .reg .b32 %active, %x;
    mov.b32 %x, 0;
    activemask.b32 %active;
    match.any.sync.b32 %x, %x, %active;
    ret;
}

.entry my_kernel(.param .align 4 .b8 params[4]) {
    call my_func, ();
}
)";

void checkCuda(CUresult rv) {
    if (rv != CUDA_SUCCESS) {
        printf("CUDA failure!\n");
        abort();
    }
}

int main(int argc, char **argv) {
    CUjit_option options[5];
    void* optionVals[5];
    char error_log[8192],
         info_log[8192];
    unsigned int logSize = 8192;
    void *cuOut;
    size_t outSize;

    options[0] = CU_JIT_LOG_VERBOSE;
    optionVals[0] = (void*) 1;
    options[1] = CU_JIT_INFO_LOG_BUFFER;
    optionVals[1] = (void*) info_log;
    options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
    optionVals[2] = (void*) (uintptr_t) logSize;
    options[3] = CU_JIT_ERROR_LOG_BUFFER;
    optionVals[3] = (void*) error_log;
    options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
    optionVals[4] = (void*) (uintptr_t) logSize;

    CUcontext ctx = nullptr;
    checkCuda(cuInit(0));
    checkCuda(cuDevicePrimaryCtxRetain(&ctx, 0));
    checkCuda(cuCtxSetCurrent(ctx));
    CUlinkState lState;

    checkCuda(cuLinkCreate(5,options, optionVals, &lState));

    checkCuda(cuLinkAddData(lState, CU_JIT_INPUT_PTX, (void *) source,
                            strlen(source) + 1, 0, 0, 0, 0));
    checkCuda(cuLinkComplete(lState, &cuOut, &outSize));
    printf("%s\n%s", error_log, info_log);

    return 0;
}

This instead prints

ptxas info    : 0 bytes gmem
ptxas info    : Function properties for my_func
ptxas         .     8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info    : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info    : Function properties for my_kernel
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 356 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'my_kernel':
info    : used 24 registers, 8 stack, 0 bytes smem, 356 bytes cmem[0], 0 bytes lmem%

In particular, note the line 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill load

What gives?! Is it a bug, or am I doing something else wrong? I would be grateful for any advice.

Here is the output of nvidia-smi in case this is relevant:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0 Off |                  Off |
|  0%   47C    P8              28W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

That’s not what I see. Leaving aside the fact that your code has no globally visible side effects and therefore compiles to a do-nothing-interesting kernel with -O3, with -O0 I see not one but two CALL instructions during kernel processing. Perhaps we have different definitions of “inlined”. Even in the -O3 case I see a CALL instruction to an “empty” routine.

However, after fiddling with the JIT cache to inspect the code produced by the “C++ driver” you created, I would say it appears that code was generated with -G. There is no optimization that I can see.

I seem to have better luck with cuModuleLoadDataEx:

# cat t124a.cu
#include <stdio.h>
#include <cuda.h>
#include <string.h>

const char *source = R"(
.version 7.8
.target sm_89
.address_size 64

.func my_func() {
    .reg .b32 %active, %x;
    mov.b32 %x, 0;
    activemask.b32 %active;
    match.any.sync.b32 %x, %x, %active;
    ret;
}

.entry my_kernel(.param .align 4 .b8 params[4]) {
    call my_func, ();
}
)";

void checkCuda(CUresult rv) {
    if (rv != CUDA_SUCCESS) {
        printf("CUDA failure!\n");
        abort();
    }
}

int main(int argc, char **argv) {
    const int num_opt = 6;
    CUjit_option options[num_opt];
    void* optionVals[num_opt];
    char error_log[8192],
         info_log[8192];
    unsigned int logSize = 8192;

    options[0] = CU_JIT_LOG_VERBOSE;
    optionVals[0] = (void*) 1;
    options[1] = CU_JIT_INFO_LOG_BUFFER;
    optionVals[1] = (void*) info_log;
    options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
    optionVals[2] = (void*) (uintptr_t) logSize;
    options[3] = CU_JIT_ERROR_LOG_BUFFER;
    optionVals[3] = (void*) error_log;
    options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
    optionVals[4] = (void*) (uintptr_t) logSize;
    options[5] = CU_JIT_OPTIMIZATION_LEVEL;
    optionVals[5] = (void*)4;

    CUcontext ctx = nullptr;
    checkCuda(cuInit(0));
    checkCuda(cuDevicePrimaryCtxRetain(&ctx, 0));
    checkCuda(cuCtxSetCurrent(ctx));
#if 0
    void *cuOut;
    size_t outSize;
    CUlinkState lState;
    checkCuda(cuLinkCreate(num_opt,options, optionVals, &lState));

    checkCuda(cuLinkAddData(lState, CU_JIT_INPUT_PTX, (void *) source,
                            strlen(source) + 1, 0, 0, 0, 0));
    checkCuda(cuLinkComplete(lState, &cuOut, &outSize));
#else
    CUmodule m;
    checkCuda(cuModuleLoadDataEx(&m, (void *)source, num_opt, options, optionVals));
#endif
    printf("%s\n%s", error_log, info_log);
    printf("\n");
    return 0;
}
# nvcc -o t124a t124a.cu -lcuda
# rm -Rf ~/.nv/ComputeCache/*
# ./t124a

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info    : Function properties for my_kernel
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 6 registers, 356 bytes cmem[0]
ptxas info    : Function properties for my_func
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
#

I confirmed this is a optimized “empty” kernel, so there might still be differences between the jit and offline compilation methods (for the cuModuleLoadDataEx method, I am doubtful there are differences in this specific case. If I compile the code with ptxas and -O4 I get the same 6 register usage, empty kernel.). I guess I would drop the notion that the two are comparable. However I acknowledge there seems to be an unanswered question as to the optimization of the method you provided. I don’t have an answer to that.

For reference, here is how I inspected the JIT-generated SASS code. Obviously all of this is subject to change, I was using CUDA 12.2/driver 535.86.10:

  1. The jit cache by default is located at ~/.nv/ComputeCache.
  2. I don’t fully understand the directory or index structure, so my approach is to clear the jit cache before doing an experiment. rm -Rf ~/.nv/ComputeCache/* serves that purpose.
  3. generate a JIT-ed object, perhaps via running your “C++ driver” as depicted above.
  4. The jit cache will now have a single relevant directory chain in it, that has a single file in it with a numerical name. In my case the generated file was in the directory ~/.nv/ComputeCache/2/0 and the file name was 35efeb93e6f6cc. This file is not directly consumable by cuobjdump tool. As njuffa indicates in the previous linked article, there is a “preamble” followed by an ELF-formatted object. The ELF formatted object is understandable by cuobjdump, so the objective is to remove the preamble. The preamble seems to consist (mainly) of the PTX code in text form, eventually followed by the ELF header, the first several characters of which are (as viewed in vi editor): ^?ELF (note that the ? character seems to vary, such as it may be a space, or perhaps some other character).
  5. Using vi, I was able to delete lines and characters prior to that header, then save the file. At that point you can use cuobjdump -sass ... on the file, to see the JIT-ed SASS code.

(There are other items in the JIT object that I am curious about. When I use cuModuleLoadDataEx, immediately prior to the beginning of the ELF formatted section, I see -O4 in plain text. When I use the cuLinkAddData method, I see -c in the same spot.)

After a little more study, I believe that the cuLinkAddData method is effectively specifying -rdc=true (which makes sense to me - it was added to the driver API when relocatable device code with device linking was added to CUDA). The cuModuleLoadDataEx method is not.

In my view, ptxas is not really sufficiently/fully documented to make it an end-user tool, but I don’t wish to argue about that. It’s not necessary to use ptxas to directly compile ptx code, however; nvcc is capable of that.

When I use nvcc to compile your ptx code, I get the “fully optimized” case (what you get with cuModuleLoadDataEx), but when I specify -rdc=true, I get the output you see with the cuLinkAddData method:

# nvcc -cubin t3.ptx -Xptxas=-v -arch=sm_89
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info    : Function properties for my_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 6 registers, 356 bytes cmem[0]
ptxas info    : Function properties for my_func
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
# nvcc -rdc=true -cubin t3.ptx -Xptxas=-v -arch=sm_89
ptxas info    : 0 bytes gmem
ptxas info    : Function properties for my_func
    8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info    : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info    : Function properties for my_kernel
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 356 bytes cmem[0]
#

So I think it may be the case that the cuLinkAddData method is intentionally not “inlining” (using that term loosely) because it wants to preserve all entry points that you have indicated in your PTX.

I suppose it might be possible to use LTO to work around this, but that is an exercise for another time.

Dear @Robert_Crovella,

this was extremely helpful. Thank you very much for your detailed investigation.

For added context: my project is a just-in-time compiler (Dr.Jit) that directly generates PTX. That PTX code has no C/C++ counterpart. The project should also work on machines with just a graphics driver (i.e. no CUDA SDK installed). So nvcc/ptxas aren’t available, and the cuLink* API is all I have to work with.

I can now see how cuLinkAddData specifies -rdc=true as this interface is meant to combine multiple compilation units. Luckily my JIT project doesn’t need this at all. So it seems I’ve been using the wrong API this whole time.

Good point about this being kind of a nonsensical kernel (no side effects). I was just playing with the original cuLinkAddData compilation, and removing any further PTX instructions somehow changed the linker behavior and removed the reported spills. But this is a moot point now that the issue is figured out.

Thank you again,
Wenzel

Dear @Robert_Crovella,

I remembered now that there was a reason not to use cuModuleLoadDataEx.

My project calls cuLinkCreate, cuLinkAddData, and cuLinkComplete to generate a CUBIN representation. It then caches this CUBIN data for future use. cuModuleLoadDataEx can also be used to compile and load PTX, but it does not provide access to the compiled output.

I can already imagine what you will say next :-).

Why would you create some kind of caching infrastructure around PTX->CUBIN compilation, when the NVIDIA tooling already does this automatically when using cuModuleLoadDataEx?” (via the .nv/ComputeCache directory you had mentioned)

The reason is that the project in question is another full-blown JIT compiler with its own caching infrastructure that not only keeps track of NVIDIA-specific compilation inputs/outputs but also its own internal state. I want to get the CUBIN data out so everything can be co-located to ensure that a cache hit in our cache is always guaranteed to be fast, without potentially requiring another recompilation by the NVIDIA PTX-JIT.

Which brings me to the next question: is there a way to use the cuLinkCreate, cuLinkAddData, and cuLinkComplete functions but without having them internally specifying the equivalent of -rdc=true to the PTX-JIT?

Or is there some form of API where I can retroactively extract the CUBIN representation from a CUmodule? That way, I could retain the existing functionality with cuModuleLoadDataEx.

Thank you very much!

This is huge! My benchmark kernel goes from 343 ms to 222 ms just by compiling with cuModuleLoadDataEx instead of the cuLink* API. The implicit -rdc=true flag seems to have a severe impact on runtime performance even when only a single compilation unit is added via cuLinkAddData.

You can decouple compilation from loading using the ptx compiler APIs. This means that you can take PTX and compile it, then save the compiled object. You can also load the compiled object using cuModuleLoadDataEx.

One of the stated purposes is custom caching.

You can decouple compilation from loading using the ptx compiler APIs. This means that you can take PTX and compile it, then save the compiled object. You can also load the compiled object using cuModuleLoadDataEx .

This looks interesting but has some major downsides. Linking to the PTX compiler would require shipping ~50MB of static library code to users.

My project is designed not to have a dependence on the CUDA SDK both on the developer and user side-- it resolves cuda.so dynamically and can therefore can only use the driver API. I realize that this is a self-imposed limitation, but it is one that makes sense for this particular project due to its distribution mechanism on PyPI.