Does the JIT compiler perform device link-time optimization?

phw89 · May 5, 2021, 1:24pm

Before device link-time optimization (DLTO) was introduced in CUDA 11.2, it was relatively easy to ensure forward compatibility without worrying too much about differences in performance. You would typically just create a fatbinary containing PTX for the lowest possible arch and SASS for the specific architectures you would normally target. For any future GPU architectures, the JIT compiler would then assemble the PTX into SASS optimized for that specific GPU arch.

Now, however, with DLTO, it is less clear to me how to ensure forward compatibility and maintain performance on those future architectures.

Let’s say I compile/link an application using nvcc with the following options:

Compile
-gencode=arch=compute_52,code=[compute_52,lto_52,lto_61]

Link
-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

This will create a fatbinary containing PTX for sm_52, LTO intermediaries for sm_52 and sm_61, and link-time optimized SASS for sm_52 and sm_61 (or at least this appears to be the case when dumping the resulting fatbin sections using cuobjdump -all anyway).

Assuming the above is correct, what happens when the application is run on a later GPU architecture (e.g. sm_70)? Does the JIT compiler just assemble the sm_52 PTX without using link-time optimization (resulting in less optimal code)? Or does it somehow link the LTO intermediaries using link-time optimization? Is there a way to determine/guide what the JIT compiler is doing?

phw89 · May 5, 2021, 1:54pm

Possibly related?

phw89 · November 23, 2022, 10:14am

I have now heard back from a member of the NVIDIA driver team:

Prior to the driver version released with CUDA Toolkit 12.0, the driver would JIT the highest arch available, regardless of whether it was PTX or LTO NVVM-IR. However, JIT compilation of NVVM was not guaranteed to be forward compatible with later architectures (this could cause applications to fail with a “device kernel image is invalid” CUDA error).

Therefore, starting the with the CUDA 12.0 driver, the driver will only JIT the highest PTX available, i.e. it will not JIT NVVM code.

system · December 7, 2022, 10:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization Technical Blog	16	1491	September 6, 2024
Latest driver breaks fatbinaries using device link-time optimization CUDA Programming and Performance	5	558	November 22, 2022
Using device link-time optimization results in much larger fatbinaries CUDA Programming and Performance	4	558	September 21, 2021
Fatbinary best practices CUDA Programming and Performance	6	1293	November 23, 2022
CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library Technical Blog	6	620	August 22, 2024
Driver JIT compilation CUDA Programming and Performance	6	4444	September 9, 2016
JIT Details CUDA Programming and Performance	14	3412	January 9, 2018
JIT compilation PTX to machine code may fail for certain GPUs ? CUDA Programming and Performance	4	5794	January 21, 2015
How to speed up JIT compilation? CUDA Programming and Performance cuda	4	1326	December 24, 2021
JIT .cu CUDA Programming and Performance	17	8073	October 13, 2010

Does the JIT compiler perform device link-time optimization?

Related topics