How to speed up JIT compilation?

eifjccfuuehfn · December 23, 2021, 8:18pm

My CUDA application has a very long load time in the beginning, due to JIT compilation.

And I have restrictions that don’t allow me to build the GPU binary code with NVCC, and I also cannot depend on a JIT cache.

Is there a way to speed up JIT compilation itself? I noticed that it doesn’t use all the CPU cores. Is there a way to make it use them?

njuffa · December 23, 2021, 9:17pm

That is called an overconstrained problem in engineering terms. I am curious: What is the particular reason for avoiding use of the JIT cache for this use case?

JIT compilation happens via the pxtas functionality incorporated into the CUDA driver. Pretty much everything that happens in the CUDA driver is running single threaded. The performance is dominated primarily by single-thread CPU performance and secondarily by system memory performance. Use of a CPU with a base frequency >= 3.5 GHz highly recommended. With JIT cache I would have said using an NVMe SSD should also help, but whether that makes any difference when the JIT cache is not in use I do not know.

ptxas is an optimizing compiler and therefore compilation time generally increases with optimization level (0 through 3). I am assuming you want the resulting machine code to be fully optimized. If you are willing to dial down optimization level the question is: can optimization level be controlled for JIT compilation (for offline compilation this is simply a flag nvcc passes to ptxas)? Probably so, but I don’t know how that works.

epk · December 23, 2021, 11:10pm

Write your kernels in PTX instead of CUDA-C++. That should give you a very nice speedup. Of course, that’s also difficult an inconvenient…

Robert_Crovella · December 23, 2021, 11:37pm

I don’t think that will obviate the need for JIT compilation. The JIT mechanism converts PTX into SASS.

njuffa · December 24, 2021, 12:49am

Here is another thought: The execution time of ptxas is also a function of the amount of code that needs to be handled. The LLVM-derived part of the CUDA toolchain normally inlines functions and unroll loops aggressively. This can lead to very voluminous PTX code: I have encountered files with 60,000 lines of PTX code when looking into details of reports of excessive amount of time being spent at the ptxas stage of compilation. The use of __noinline__ and #pragma unroll 1 may help with cutting down PTX code size, which may benefit compilation times in exchange for some amount of potential performance loss.

Topic		Replies	Views
Driver JIT compilation CUDA Programming and Performance	6	4405	September 9, 2016
Compiling through nvcc versus JIT driver compilation CUDA Programming and Performance	5	688	April 22, 2021
Consuming a populated JIT cache with read-only permissions CUDA Programming and Performance	3	822	December 23, 2021
CUDA Expression Templates and Just in Time Compilation (JIT) CUDA Programming and Performance	1	1853	April 9, 2013
JIT .cu CUDA Programming and Performance	17	8070	October 13, 2010
Suggestion to decrease compilation time CUDA Programming and Performance	8	39	January 31, 2025
Avoiding JIT compiling on system with 2 different GPUs CUDA Programming and Performance	6	1086	June 22, 2017
JIT Details CUDA Programming and Performance	14	3373	January 9, 2018
Disable PTX JIT Compilation CUDA Programming and Performance	15	839	September 8, 2023
JIT compilation PTX to machine code may fail for certain GPUs ? CUDA Programming and Performance	4	5725	January 21, 2015

How to speed up JIT compilation?

Related topics