Driver JIT compilation

phw89 · September 9, 2016, 9:46am

Assuming a CUDA application is compiled so that only the PTX is generated (say for compute capability 3.0), i.e. by specifying:

-gencode compute_30,compute_30

At runtime, the driver will JIT compile the PTX into SASS for the real architecture of the running GPU, say for example sm_61 for a GTX 1080.

My question is whether the JIT compilation will result in SASS that is identical (i.e. fully optimized) to the SASS that would have been produced if the application had been compiled using the following option:

-gencode compute_30,sm_61

Robert_Crovella · September 9, 2016, 1:03pm

Not necessarily. There is no stated guarantee that they will be identical.

phw89 · September 9, 2016, 1:15pm

By what mechanism does the driver compile the PTX into SASS then?

Robert_Crovella · September 9, 2016, 2:10pm

JIT process is covered in the nvcc documentation:

[url]http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#just-in-time-compilation[/url]

I acknowledge that that does not provide a detailed answer to “By what mechanism does the driver compile the PTX into SASS then?”

The best I can offer is that there is a tool/codepath in the driver that detects at runtime that a particular GPU binary contains PTX but not SASS, and follows a codepath similar to what is contained in the ptxas tool to generate suitable SASS on the fly.

If you require a more detailed description, I don’t have it and don’t know where to find it.

A few other comments not directly related to your question:

JIT compilation involves conversion of PTX into SASS, and it is always handled by the GPU driver.

Based on that statement, then, we can assume that different drivers may produce different SASS, for otherwise identical inputs, in a JIT-compilation scenario.

When we compare JIT-produced SASS vs. SASS produced by nvcc (as a result of specification of the target architecture for code generation), the SASS produced by nvcc uses a separate tool called ptxas (you can find ptxas on your machine - it is a separate compiler/assembler in /usr/local/cuda/bin on an ordinary linux install), whereas the SASS produced by JIT compilation is generated by the driver itself (i.e. not by ptxas - ptxas is installed by the CUDA toolkit installer, and it is not necessary to be present on a machine to support JIT compilation - only the driver is required).

Therefore, since the generating code is the driver in one case (via a codepath like ptxas embedded in the driver), and explicitly by the installed ptxas tool in the other case, we can assume that the generated SASS may be different, and, again, there are no stated claims in the CUDA documentation otherwise.

phw89 · September 9, 2016, 3:00pm

I suppose my question is whether the code path taken by the driver when assembling the PTX results in SASS that is at least as optimized as that produced from ptxas when passing to -Xptxas=-O3 to nvcc (the default). If not more so, depending on whether future versions of the driver include further improvements to the PTX->SASS stage and assuming the application is not re-compiled in the meantime.

Basically, should I be nervous about relying on JIT compilation for performance critical applications (assuming the initial JIT cost can be ignored)?

The only disadvantage listed by the documentation is: “The disadvantage of just in time compilation is increased application startup delay”. For cases where the JIT cache is not disabled, this can essentially be ignored.

njuffa · September 9, 2016, 3:17pm

As txbob says, there are no guarantees. Based on a limited amount of observations, I would say:

[1] The PTXAS component of the driver appears to be updated more frequently than the PTXAS component of the offline compiler, which is updated only when a new CUDA version ships

[2] Newer PTXAS versions found in a driver can contain bug fixes and performance enhancements not found in the offline compiler’s PXTAS component; however, the introduction of new bugs and performance regressions is also possible as with any revision to complex software

[3] In general, the PTXAS component of the driver and the PTXAS component of the offline compiler seem to generate very similar or identical code, by visual comparison of the machine code (extracted from the JIT cache and the executable, respectively)

In addition to JIT compilation from PTX, CUDA also supports JIT compilation from HLL source via NVRTC, and we recently discussed a case in these forums where the code generated by NVRTC seemed to miss optimizations performed by the offline compiler. This was just a single observation, and it wasn’t immediately clear whether the differences were due to different compiler configuration or the compilers themselves.

In practical terms, I would claim that it is a best practice to ship CUDA applications with machine code for all intended target architectures baked into the executable (i.e. a fat binary), and include PTX only for the most recent architecture, to be used for JIT compilation on future architectures not yet available at the time the software ships.

Using JIT compilation from either PTX or HLL source for run-time code creation is a separate use case of course. There are plenty of applications that make use of that to dynamically compile user queries or formulas.

phw89 · September 9, 2016, 3:40pm

Ok great, thanks for the info :)

Topic		Replies	Views
JIT Details CUDA Programming and Performance	14	3373	January 9, 2018
How to speed up JIT compilation? CUDA Programming and Performance cuda	4	1288	December 24, 2021
Compiling through nvcc versus JIT driver compilation CUDA Programming and Performance	5	688	April 22, 2021
JIT .cu CUDA Programming and Performance	17	8070	October 13, 2010
Fatbinary best practices CUDA Programming and Performance	6	1260	November 23, 2022
JIT compilation PTX to machine code may fail for certain GPUs ? CUDA Programming and Performance	4	5725	January 21, 2015
Disable PTX JIT Compilation CUDA Programming and Performance	15	839	September 8, 2023
Visual studio cuda code generation selection benefits CUDA Programming and Performance cuda , visual-studio	6	2634	August 29, 2021
Does the JIT compiler perform device link-time optimization? CUDA Programming and Performance	3	1076	November 23, 2022
future-proof binaries -- nvcc -code and -arch options how to select the best combination of -code an CUDA Programming and Performance	7	8735	November 11, 2009

Driver JIT compilation

Related topics