(This issue is unrelated to my previous post, which was more of a general question)
Dear all,
I’ve run into a miscompilation in the PTX → SASS JIT compiler when using indirect calls with an explicitly specified list of call targets. I’ve attached a sample PTX file with generated SASS output demonstrating the issue, which is that the indirectly called functions clobber registers used by the caller. I suspect that there is a bug in the compiler’s inference of calling conventions (which is the main reason for specifying a list of call targets).
repro.cubin (48.1 KB) repro.ptx (29.3 KB) repro.sass (283.5 KB)
This is the smallest reproducer I was able to find in my experiments – in fact, having a program of a sufficient size, and with a sufficiently large number of registers seems to be a prerequisite to triggering the problem.
The reproducer contains two callees “func_…” and one kernel (“enoki_…”) with a single grid-stride loop. Now, the details of this particular code really don’t matter that much – I want to point your attention to a specific variable in the attached sample code, which is the counter variable %r0 in “enoki_…” implementing the grid-stride loop, which steps through a list of elements to be processed. In the PTX code, this variable is purely used for counting and should not be affected by any of the other code.
When compiled to SASS via NVRTC, this program crashes with weird values in the counter. In the attached SASS dump, variable %r0 maps to R51. Everything is fine in the kernel body, but you can see that the other functions overwrite this variable with invalid contents – there is even an operation “FADD.FTZ R51, -R88, R51” that puts a floating point value there. Importantly, the counter isn’t saved onto the stack frame by the callees.
I can work around the miscompilation by using .callprototype
instead of a list of call targets, but that is of course not as nice since it would be nice to benefit from optimizations that reduce the amount of stack memory. Finally: PTX offers two different ways of specifying call targets: via a .global
list of function pointers, or a .calltargets
directive. Both appear to lead to incorrect code.
Best,
Wenzel
PS: This is on Ubuntu 20.04.01 LTS with CUDA 11.1, driver version: 455.45.01 and a Titan RTX card.
P.P.S: “ptxas” and NVRTC appear to generate different code given the same PTX input. It is not 100% clear to me why. For reference, this is how I invoked NVRTC to produce the supplied CUBIN file:
CUjit_option arg[] = {
CU_JIT_OPTIMIZATION_LEVEL,
CU_JIT_INFO_LOG_BUFFER,
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,
CU_JIT_ERROR_LOG_BUFFER,
CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES,
CU_JIT_LOG_VERBOSE,
CU_JIT_GENERATE_LINE_INFO
};
void *argv[] = {
(void *) 4,
(void *) info_log,
(void *) log_size,
(void *) error_log,
(void *) log_size,
(void *) 1,
(void *) 1
};
CUlinkState link_state;
cuda_check(cuLinkCreate(sizeof(argv) / sizeof(void *), arg, argv, &link_state));
duda_check(cuLinkAddData(link_state, CU_JIT_INPUT_PTX, (void *) buffer,
buffer_size, nullptr, 0, nullptr, nullptr));
cuda_check(cuLinkComplete(link_state, &link_output, &link_output_size));