That’s not what I see. Leaving aside the fact that your code has no globally visible side effects and therefore compiles to a do-nothing-interesting kernel with -O3, with -O0 I see not one but two CALL instructions during kernel processing. Perhaps we have different definitions of “inlined”. Even in the -O3 case I see a CALL instruction to an “empty” routine.
However, after fiddling with the JIT cache to inspect the code produced by the “C++ driver” you created, I would say it appears that code was generated with -G. There is no optimization that I can see.
I seem to have better luck with cuModuleLoadDataEx:
# cat t124a.cu
#include <stdio.h>
#include <cuda.h>
#include <string.h>
const char *source = R"(
.version 7.8
.target sm_89
.address_size 64
.func my_func() {
.reg .b32 %active, %x;
mov.b32 %x, 0;
activemask.b32 %active;
match.any.sync.b32 %x, %x, %active;
ret;
}
.entry my_kernel(.param .align 4 .b8 params[4]) {
call my_func, ();
}
)";
void checkCuda(CUresult rv) {
if (rv != CUDA_SUCCESS) {
printf("CUDA failure!\n");
abort();
}
}
int main(int argc, char **argv) {
const int num_opt = 6;
CUjit_option options[num_opt];
void* optionVals[num_opt];
char error_log[8192],
info_log[8192];
unsigned int logSize = 8192;
options[0] = CU_JIT_LOG_VERBOSE;
optionVals[0] = (void*) 1;
options[1] = CU_JIT_INFO_LOG_BUFFER;
optionVals[1] = (void*) info_log;
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
optionVals[2] = (void*) (uintptr_t) logSize;
options[3] = CU_JIT_ERROR_LOG_BUFFER;
optionVals[3] = (void*) error_log;
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
optionVals[4] = (void*) (uintptr_t) logSize;
options[5] = CU_JIT_OPTIMIZATION_LEVEL;
optionVals[5] = (void*)4;
CUcontext ctx = nullptr;
checkCuda(cuInit(0));
checkCuda(cuDevicePrimaryCtxRetain(&ctx, 0));
checkCuda(cuCtxSetCurrent(ctx));
#if 0
void *cuOut;
size_t outSize;
CUlinkState lState;
checkCuda(cuLinkCreate(num_opt,options, optionVals, &lState));
checkCuda(cuLinkAddData(lState, CU_JIT_INPUT_PTX, (void *) source,
strlen(source) + 1, 0, 0, 0, 0));
checkCuda(cuLinkComplete(lState, &cuOut, &outSize));
#else
CUmodule m;
checkCuda(cuModuleLoadDataEx(&m, (void *)source, num_opt, options, optionVals));
#endif
printf("%s\n%s", error_log, info_log);
printf("\n");
return 0;
}
# nvcc -o t124a t124a.cu -lcuda
# rm -Rf ~/.nv/ComputeCache/*
# ./t124a
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'my_kernel' for 'sm_89'
ptxas info : Function properties for my_kernel
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 6 registers, 356 bytes cmem[0]
ptxas info : Function properties for my_func
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
#
I confirmed this is a optimized “empty” kernel, so there might still be differences between the jit and offline compilation methods (for the cuModuleLoadDataEx method, I am doubtful there are differences in this specific case. If I compile the code with ptxas and -O4 I get the same 6 register usage, empty kernel.). I guess I would drop the notion that the two are comparable. However I acknowledge there seems to be an unanswered question as to the optimization of the method you provided. I don’t have an answer to that.
For reference, here is how I inspected the JIT-generated SASS code. Obviously all of this is subject to change, I was using CUDA 12.2/driver 535.86.10:
- The jit cache by default is located at
~/.nv/ComputeCache. - I don’t fully understand the directory or index structure, so my approach is to clear the jit cache before doing an experiment.
rm -Rf ~/.nv/ComputeCache/*serves that purpose. - generate a JIT-ed object, perhaps via running your “C++ driver” as depicted above.
- The jit cache will now have a single relevant directory chain in it, that has a single file in it with a numerical name. In my case the generated file was in the directory
~/.nv/ComputeCache/2/0and the file name was35efeb93e6f6cc. This file is not directly consumable bycuobjdumptool. As njuffa indicates in the previous linked article, there is a “preamble” followed by an ELF-formatted object. The ELF formatted object is understandable bycuobjdump, so the objective is to remove the preamble. The preamble seems to consist (mainly) of the PTX code in text form, eventually followed by the ELF header, the first several characters of which are (as viewed invieditor):^?ELF(note that the ? character seems to vary, such as it may be a space, or perhaps some other character). - Using
vi, I was able to delete lines and characters prior to that header, then save the file. At that point you can usecuobjdump -sass ...on the file, to see the JIT-ed SASS code.
(There are other items in the JIT object that I am curious about. When I use cuModuleLoadDataEx, immediately prior to the beginning of the ELF formatted section, I see -O4 in plain text. When I use the cuLinkAddData method, I see -c in the same spot.)