How to speed up JIT compilation?

My CUDA application has a very long load time in the beginning, due to JIT compilation.

And I have restrictions that don’t allow me to build the GPU binary code with NVCC, and I also cannot depend on a JIT cache.

Is there a way to speed up JIT compilation itself? I noticed that it doesn’t use all the CPU cores. Is there a way to make it use them?

That is called an overconstrained problem in engineering terms. I am curious: What is the particular reason for avoiding use of the JIT cache for this use case?

JIT compilation happens via the pxtas functionality incorporated into the CUDA driver. Pretty much everything that happens in the CUDA driver is running single threaded. The performance is dominated primarily by single-thread CPU performance and secondarily by system memory performance. Use of a CPU with a base frequency >= 3.5 GHz highly recommended. With JIT cache I would have said using an NVMe SSD should also help, but whether that makes any difference when the JIT cache is not in use I do not know.

ptxas is an optimizing compiler and therefore compilation time generally increases with optimization level (0 through 3). I am assuming you want the resulting machine code to be fully optimized. If you are willing to dial down optimization level the question is: can optimization level be controlled for JIT compilation (for offline compilation this is simply a flag nvcc passes to ptxas)? Probably so, but I don’t know how that works.

Write your kernels in PTX instead of CUDA-C++. That should give you a very nice speedup. Of course, that’s also difficult an inconvenient…

I don’t think that will obviate the need for JIT compilation. The JIT mechanism converts PTX into SASS.

Here is another thought: The execution time of ptxas is also a function of the amount of code that needs to be handled. The LLVM-derived part of the CUDA toolchain normally inlines functions and unroll loops aggressively. This can lead to very voluminous PTX code: I have encountered files with 60,000 lines of PTX code when looking into details of reports of excessive amount of time being spent at the ptxas stage of compilation. The use of __noinline__ and #pragma unroll 1 may help with cutting down PTX code size, which may benefit compilation times in exchange for some amount of potential performance loss.