Cuda.jit in Numba faster on first-time startup than jit

Hello,

I have written two versions of accelerated function mfuncy_jit() in Numba, one using jit for CPUs and the other one using cuda.jit to target GPU. When I run myfunc_jit optimized under jit, then it takes far longer for the function to complete in its first initial run than later runs, which I guess is due to just-in-time compilation during the first run. However, when I run myfunc_jit under cuda.jit, then there is no such discrepancy between the initial and later runs of the function. Both versions of the function produce the correct result, and the cuda.jit version is consistently faster by orders of magnitude compared to the CPU version. My question is since the first-time startup time for the cuda.jit function seems to be negligible, does it mean that some code compilation has already happened in the background before the function has been called, in contrast to the jit version for CPUs?

For the cuda.jit functionality, numba implements a jit cache. Before compiling a CUDA decorated function, the numba system (by default, when cache = True) will check the jit cache for a matching compilation signature. If it finds one, it uses that rather than recompiling. This is true even on the first encounter of the function in your program. So if you’ve made no code changes, after the very first time you run your program, you won’t encounter JIT recompilation.

I’m less familiar with the CPU jit case. It appears that numba maintains a cache there as well but I had trouble locating latest docs on this topic.

OK, that makes sense. Many thanks for the explanation!