Miscompilation of indirect call with an explicitly specified list of call targets (PTX)

(This issue is unrelated to my previous post, which was more of a general question)

Dear all,

I’ve run into a miscompilation in the PTX → SASS JIT compiler when using indirect calls with an explicitly specified list of call targets. I’ve attached a sample PTX file with generated SASS output demonstrating the issue, which is that the indirectly called functions clobber registers used by the caller. I suspect that there is a bug in the compiler’s inference of calling conventions (which is the main reason for specifying a list of call targets).

repro.cubin (48.1 KB) repro.ptx (29.3 KB) repro.sass (283.5 KB)

This is the smallest reproducer I was able to find in my experiments – in fact, having a program of a sufficient size, and with a sufficiently large number of registers seems to be a prerequisite to triggering the problem.

The reproducer contains two callees “func_…” and one kernel (“enoki_…”) with a single grid-stride loop. Now, the details of this particular code really don’t matter that much – I want to point your attention to a specific variable in the attached sample code, which is the counter variable %r0 in “enoki_…” implementing the grid-stride loop, which steps through a list of elements to be processed. In the PTX code, this variable is purely used for counting and should not be affected by any of the other code.

When compiled to SASS via NVRTC, this program crashes with weird values in the counter. In the attached SASS dump, variable %r0 maps to R51. Everything is fine in the kernel body, but you can see that the other functions overwrite this variable with invalid contents – there is even an operation “FADD.FTZ R51, -R88, R51” that puts a floating point value there. Importantly, the counter isn’t saved onto the stack frame by the callees.

I can work around the miscompilation by using .callprototype instead of a list of call targets, but that is of course not as nice since it would be nice to benefit from optimizations that reduce the amount of stack memory. Finally: PTX offers two different ways of specifying call targets: via a .global list of function pointers, or a .calltargetsdirective. Both appear to lead to incorrect code.

Best,
Wenzel

PS: This is on Ubuntu 20.04.01 LTS with CUDA 11.1, driver version: 455.45.01 and a Titan RTX card.

P.P.S: “ptxas” and NVRTC appear to generate different code given the same PTX input. It is not 100% clear to me why. For reference, this is how I invoked NVRTC to produce the supplied CUBIN file:

CUjit_option arg[] = {
    CU_JIT_OPTIMIZATION_LEVEL,
    CU_JIT_INFO_LOG_BUFFER,
    CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,
    CU_JIT_ERROR_LOG_BUFFER,
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES,
    CU_JIT_LOG_VERBOSE,
    CU_JIT_GENERATE_LINE_INFO
};

void *argv[] = {
    (void *) 4,
    (void *) info_log,
    (void *) log_size,
    (void *) error_log,
    (void *) log_size,
    (void *) 1,
    (void *) 1
};

CUlinkState link_state;
cuda_check(cuLinkCreate(sizeof(argv) / sizeof(void *), arg, argv, &link_state));
duda_check(cuLinkAddData(link_state, CU_JIT_INPUT_PTX, (void *) buffer,
                         buffer_size, nullptr, 0, nullptr, nullptr));
cuda_check(cuLinkComplete(link_state, &link_output, &link_output_size));
1 Like

You may wish to file a bug. Also, CUDA 11.2 came out recently. While it seems unlikely that it would manifest any differently, it may be worth a check.

To be clear I’m not confirming a defect here. I haven’t studied or run your test case.

Filed under bug 3209799.

The issue persists with CUDA 11.2 and the latest driver (460.27.04).

The QA team handling your bug will likely ask you for a complete repro case, soup to nuts. A complete application that loads the PTX and calls an function from it. In addition, a description of how to determine pass/fail.

The principal purpose of NVRTC is run-time compilation of CUDA C++ source code to PTX. At that point (with PTX in hand) the driver API functions are used to continue the process of compile and run. NVRTC does not generate SASS. NVRTC API functions all begin with nvrtc. CUDA driver API functions begin with cu (but not cuda - that is the runtime API). The functions you’ve shown are all driver API functions. This isn’t necessarily terribly important, but it may be confusing to refer to NVRTC here.

Note that the compiler backend is unlikely to match between online and offline compilation. The JIT version is baked into the driver, and the driver packages get updated every few months. The offline version in ptxas only gets updated whenever there is a CUDA release, so less frequently. In general the release schedules for CUDA and drivers are not in sync, except possibly when support is added for a new architecture.

Since the compiler backend contains machine specific optimizations including instruction selection, instruction scheduling, and register allocation, some differences in the generated code can easily result due to differences in the respective compiler backends, but I would not expect there to be any dramatic code-generation differences.

@Robert_Crovella: right, my bad – I used NVRTC synonymously with the PTX->SASS transformation, but I realize now that these are completely different things.

It might be better for the QA team to reproduce the issue with the provided files first. The kernel is relatively small, and I believe that the generated code is obviously wrong here. The issue appears in a large application that JIT-compiles kernels on the fly, so adding all of that complexity may make things harder rather than easier to debug. That said, all the relevant code it is on GitHub and compileable out of the box on Linux+Ubuntu+Cmake, so let me know if you need me to provide instructions.