PTX jit spills registers in trivial programs

Robert_Crovella · February 3, 2024, 4:40pm

That’s not what I see. Leaving aside the fact that your code has no globally visible side effects and therefore compiles to a do-nothing-interesting kernel with -O3, with -O0 I see not one but two CALL instructions during kernel processing. Perhaps we have different definitions of “inlined”. Even in the -O3 case I see a CALL instruction to an “empty” routine.

However, after fiddling with the JIT cache to inspect the code produced by the “C++ driver” you created, I would say it appears that code was generated with -G. There is no optimization that I can see.

I seem to have better luck with cuModuleLoadDataEx:

# cat t124a.cu
#include <stdio.h>
#include <cuda.h>
#include <string.h>

const char *source = R"(
.version 7.8
.target sm_89
.address_size 64

.func my_func() {
    .reg .b32 %active, %x;
    mov.b32 %x, 0;
    activemask.b32 %active;
    match.any.sync.b32 %x, %x, %active;
    ret;
}

.entry my_kernel(.param .align 4 .b8 params[4]) {
    call my_func, ();
}
)";

void checkCuda(CUresult rv) {
    if (rv != CUDA_SUCCESS) {
        printf("CUDA failure!\n");
        abort();
    }
}

int main(int argc, char **argv) {
    const int num_opt = 6;
    CUjit_option options[num_opt];
    void* optionVals[num_opt];
    char error_log[8192],
         info_log[8192];
    unsigned int logSize = 8192;

    options[0] = CU_JIT_LOG_VERBOSE;
    optionVals[0] = (void*) 1;
    options[1] = CU_JIT_INFO_LOG_BUFFER;
    optionVals[1] = (void*) info_log;
    options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
    optionVals[2] = (void*) (uintptr_t) logSize;
    options[3] = CU_JIT_ERROR_LOG_BUFFER;
    optionVals[3] = (void*) error_log;
    options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
    optionVals[4] = (void*) (uintptr_t) logSize;
    options[5] = CU_JIT_OPTIMIZATION_LEVEL;
    optionVals[5] = (void*)4;

    CUcontext ctx = nullptr;
    checkCuda(cuInit(0));
    checkCuda(cuDevicePrimaryCtxRetain(&ctx, 0));
    checkCuda(cuCtxSetCurrent(ctx));
#if 0
    void *cuOut;
    size_t outSize;
    CUlinkState lState;
    checkCuda(cuLinkCreate(num_opt,options, optionVals, &lState));

    checkCuda(cuLinkAddData(lState, CU_JIT_INPUT_PTX, (void *) source,
                            strlen(source) + 1, 0, 0, 0, 0));
    checkCuda(cuLinkComplete(lState, &cuOut, &outSize));
#else
    CUmodule m;
    checkCuda(cuModuleLoadDataEx(&m, (void *)source, num_opt, options, optionVals));
#endif
    printf("%s\n%s", error_log, info_log);
    printf("\n");
    return 0;
}
# nvcc -o t124a t124a.cu -lcuda
# rm -Rf ~/.nv/ComputeCache/*
# ./t124a

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info    : Function properties for my_kernel
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 6 registers, 356 bytes cmem[0]
ptxas info    : Function properties for my_func
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
#

I confirmed this is a optimized “empty” kernel, so there might still be differences between the jit and offline compilation methods (for the cuModuleLoadDataEx method, I am doubtful there are differences in this specific case. If I compile the code with ptxas and -O4 I get the same 6 register usage, empty kernel.). I guess I would drop the notion that the two are comparable. However I acknowledge there seems to be an unanswered question as to the optimization of the method you provided. I don’t have an answer to that.

For reference, here is how I inspected the JIT-generated SASS code. Obviously all of this is subject to change, I was using CUDA 12.2/driver 535.86.10:

The jit cache by default is located at ~/.nv/ComputeCache.
I don’t fully understand the directory or index structure, so my approach is to clear the jit cache before doing an experiment. rm -Rf ~/.nv/ComputeCache/* serves that purpose.
generate a JIT-ed object, perhaps via running your “C++ driver” as depicted above.
The jit cache will now have a single relevant directory chain in it, that has a single file in it with a numerical name. In my case the generated file was in the directory ~/.nv/ComputeCache/2/0 and the file name was 35efeb93e6f6cc. This file is not directly consumable by cuobjdump tool. As njuffa indicates in the previous linked article, there is a “preamble” followed by an ELF-formatted object. The ELF formatted object is understandable by cuobjdump, so the objective is to remove the preamble. The preamble seems to consist (mainly) of the PTX code in text form, eventually followed by the ELF header, the first several characters of which are (as viewed in vi editor): ^?ELF (note that the ? character seems to vary, such as it may be a space, or perhaps some other character).
Using vi, I was able to delete lines and characters prior to that header, then save the file. At that point you can use cuobjdump -sass ... on the file, to see the JIT-ed SASS code.

(There are other items in the JIT object that I am curious about. When I use cuModuleLoadDataEx, immediately prior to the beginning of the ELF formatted section, I see -O4 in plain text. When I use the cuLinkAddData method, I see -c in the same spot.)

Topic		Replies	Views
How to lower the num_regs when use cuLinkComplete CUDA Programming and Performance	1	85	October 21, 2025
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	27134	January 19, 2009
PTX miscompiled to SASS in a specific case (shared memory buffer index)? CUDA Programming and Performance	16	1820	December 27, 2023
CUDA low-level programming - strange ptxas behavior CUDA Programming and Performance	4	1564	February 17, 2014
Ptxas compiler speed. CUDA Programming and Performance	23	12487	December 20, 2012
Help me understand "-Xptxas -dlcm=cg" (take 2) CUDA Programming and Performance	1	7096	November 24, 2010
Keep PTX with CUDA_SEPARABLE_COMPILATION CUDA NVCC Compiler cuda	5	316	October 13, 2025
Compile optix without Cmake OptiX	13	1937	June 15, 2022
Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!! CUDA Programming and Performance	11	5930	November 16, 2012
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27948	November 5, 2008

PTX jit spills registers in trivial programs

Related topics