Dear NVIDIA team,
I’ve been chasing down bad performance in a one of my CUDA kernels and tracked it to a really odd behavior of the PTX JIT compared to ptxas
. I know much has been said about how ptxas
and the PTX jit are usually similar but not guaranteed to be the same – that is all well. However, in this case, I feel that the JIT is behaving in a really bizarre way, and something is deeply wrong:
I was able to crunch down the example to a setup where a 1-line kernel calls a single function once.
.version 7.8
.target sm_89
.address_size 64
.func my_func() {
.reg .b32 %active, %x;
mov.b32 %x, 0;
activemask.b32 %active;
match.any.sync.b32 %x, %x, %active;
ret;
}
.entry my_kernel(.param .align 4 .b8 params[4]) {
call.uni my_func, ();
}
Should be an easy case – the function will be inlined. And indeed, ptxas
on the command line does that basically regardless of optimization parameters. Even at -O0
.
$ /usr/local/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:13:51_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
$ /usr/local/cuda/bin/ptxas --gpu-name sm_89 test.ptx -O0 -v
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info : Function properties for my_kernel
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 7 registers, 356 bytes cmem[0]
ptxas info : Function properties for my_func
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Here is a simple C++ driver that instead tries to do the same thing with the JIT:
#include <stdio.h>
#include <cuda.h>
#include <string.h>
const char *source = R"(
.version 7.8
.target sm_89
.address_size 64
.func my_func() {
.reg .b32 %active, %x;
mov.b32 %x, 0;
activemask.b32 %active;
match.any.sync.b32 %x, %x, %active;
ret;
}
.entry my_kernel(.param .align 4 .b8 params[4]) {
call my_func, ();
}
)";
void checkCuda(CUresult rv) {
if (rv != CUDA_SUCCESS) {
printf("CUDA failure!\n");
abort();
}
}
int main(int argc, char **argv) {
CUjit_option options[5];
void* optionVals[5];
char error_log[8192],
info_log[8192];
unsigned int logSize = 8192;
void *cuOut;
size_t outSize;
options[0] = CU_JIT_LOG_VERBOSE;
optionVals[0] = (void*) 1;
options[1] = CU_JIT_INFO_LOG_BUFFER;
optionVals[1] = (void*) info_log;
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
optionVals[2] = (void*) (uintptr_t) logSize;
options[3] = CU_JIT_ERROR_LOG_BUFFER;
optionVals[3] = (void*) error_log;
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
optionVals[4] = (void*) (uintptr_t) logSize;
CUcontext ctx = nullptr;
checkCuda(cuInit(0));
checkCuda(cuDevicePrimaryCtxRetain(&ctx, 0));
checkCuda(cuCtxSetCurrent(ctx));
CUlinkState lState;
checkCuda(cuLinkCreate(5,options, optionVals, &lState));
checkCuda(cuLinkAddData(lState, CU_JIT_INPUT_PTX, (void *) source,
strlen(source) + 1, 0, 0, 0, 0));
checkCuda(cuLinkComplete(lState, &cuOut, &outSize));
printf("%s\n%s", error_log, info_log);
return 0;
}
This instead prints
ptxas info : 0 bytes gmem
ptxas info : Function properties for my_func
ptxas . 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info : Function properties for my_kernel
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 356 bytes cmem[0]
info : 0 bytes gmem
info : Function properties for 'my_kernel':
info : used 24 registers, 8 stack, 0 bytes smem, 356 bytes cmem[0], 0 bytes lmem%
In particular, note the line 8 bytes stack frame, 8 bytes spill stores, 8 bytes spill load
What gives?! Is it a bug, or am I doing something else wrong? I would be grateful for any advice.
Here is the output of nvidia-smi
in case this is relevant:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 0% 47C P8 28W / 450W | 3MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+