Driver JIT compilation

Assuming a CUDA application is compiled so that only the PTX is generated (say for compute capability 3.0), i.e. by specifying:

-gencode compute_30,compute_30

At runtime, the driver will JIT compile the PTX into SASS for the real architecture of the running GPU, say for example sm_61 for a GTX 1080.

My question is whether the JIT compilation will result in SASS that is identical (i.e. fully optimized) to the SASS that would have been produced if the application had been compiled using the following option:

-gencode compute_30,sm_61

Not necessarily. There is no stated guarantee that they will be identical.

By what mechanism does the driver compile the PTX into SASS then?

JIT process is covered in the nvcc documentation:

[url]http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#just-in-time-compilation[/url]

I acknowledge that that does not provide a detailed answer to “By what mechanism does the driver compile the PTX into SASS then?”

The best I can offer is that there is a tool/codepath in the driver that detects at runtime that a particular GPU binary contains PTX but not SASS, and follows a codepath similar to what is contained in the ptxas tool to generate suitable SASS on the fly.

If you require a more detailed description, I don’t have it and don’t know where to find it.

A few other comments not directly related to your question:

JIT compilation involves conversion of PTX into SASS, and it is always handled by the GPU driver.

Based on that statement, then, we can assume that different drivers may produce different SASS, for otherwise identical inputs, in a JIT-compilation scenario.

When we compare JIT-produced SASS vs. SASS produced by nvcc (as a result of specification of the target architecture for code generation), the SASS produced by nvcc uses a separate tool called ptxas (you can find ptxas on your machine - it is a separate compiler/assembler in /usr/local/cuda/bin on an ordinary linux install), whereas the SASS produced by JIT compilation is generated by the driver itself (i.e. not by ptxas - ptxas is installed by the CUDA toolkit installer, and it is not necessary to be present on a machine to support JIT compilation - only the driver is required).

Therefore, since the generating code is the driver in one case (via a codepath like ptxas embedded in the driver), and explicitly by the installed ptxas tool in the other case, we can assume that the generated SASS may be different, and, again, there are no stated claims in the CUDA documentation otherwise.

I suppose my question is whether the code path taken by the driver when assembling the PTX results in SASS that is at least as optimized as that produced from ptxas when passing to -Xptxas=-O3 to nvcc (the default). If not more so, depending on whether future versions of the driver include further improvements to the PTX->SASS stage and assuming the application is not re-compiled in the meantime.

Basically, should I be nervous about relying on JIT compilation for performance critical applications (assuming the initial JIT cost can be ignored)?

The only disadvantage listed by the documentation is: “The disadvantage of just in time compilation is increased application startup delay”. For cases where the JIT cache is not disabled, this can essentially be ignored.

As txbob says, there are no guarantees. Based on a limited amount of observations, I would say:

[1] The PTXAS component of the driver appears to be updated more frequently than the PTXAS component of the offline compiler, which is updated only when a new CUDA version ships

[2] Newer PTXAS versions found in a driver can contain bug fixes and performance enhancements not found in the offline compiler’s PXTAS component; however, the introduction of new bugs and performance regressions is also possible as with any revision to complex software

[3] In general, the PTXAS component of the driver and the PTXAS component of the offline compiler seem to generate very similar or identical code, by visual comparison of the machine code (extracted from the JIT cache and the executable, respectively)

In addition to JIT compilation from PTX, CUDA also supports JIT compilation from HLL source via NVRTC, and we recently discussed a case in these forums where the code generated by NVRTC seemed to miss optimizations performed by the offline compiler. This was just a single observation, and it wasn’t immediately clear whether the differences were due to different compiler configuration or the compilers themselves.

In practical terms, I would claim that it is a best practice to ship CUDA applications with machine code for all intended target architectures baked into the executable (i.e. a fat binary), and include PTX only for the most recent architecture, to be used for JIT compilation on future architectures not yet available at the time the software ships.

Using JIT compilation from either PTX or HLL source for run-time code creation is a separate use case of course. There are plenty of applications that make use of that to dynamically compile user queries or formulas.

Ok great, thanks for the info :)