Originally published at: https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/
CUDA 11.2 features the powerful link time optimization (LTO) feature for device code in GPU-accelerated applications. Device LTO brings the performance advantages of device code optimization that were only possible in the nvcc whole program compilation mode to the nvcc separate compilation mode, which was introduced in CUDA 5.0.Separate compilation mode allows CUDA device kernel…
Figure 2 seems to be wrong, it’s the same as Figure 1. Also it would be nice to get the figures in a higher resolution.
@rkobus – Sorry about that! It’s fixed now. Hope the larger size helps as well. Thanks for the feedback!
Is the MonteCarlo
benchmark in the CUDA 11.2 sample code?
No. It was used for some internal benchmarking. Unfortunately most of the sample code does not involve separate compilation so are not good tests for LTO.
Good question. We are working on support for JIT LTO, but in 11.2 it is not supported. So in the example you give at JIT time it will JIT each individual PTX to cubin and then do a cubin link. This is the same as we have always done for JIT linking. But we should have more support for JIT LTO in future releases.
@mmurphy1 Thanks for the reply - I look forward to seeing more information in the future :)
Would you also be able to shed any light on the following: Using device link-time optimization results in much larger fatbinaries
@mmurphy1 Are there any reasons that DLTO cannot achieve the same runtime performance as the whole program compilation? Performing DLTO should be able to inline and optimize all functions thus will generate the same code as the whole program compilation, unless the linker does not always inline and optimize the code (since DLTO doesn’t have enough memory to perform the linking?).
DLTO should provide the same runtime performance as whole program. If doing “partial LTO” where some objects were not compiled with -dlto then the scope of optimization will be smaller.
but in 11.2 it is not supported.
Ok, is it supported in 11.7?
JIT LTO is supported as of 11.4, but only as a preview feature. There will be a change to the interface in 12.0 to better support our compatibility guarantees.
Thanks! However, judging by the release notes for 11.4, it looks to be more for manual JIT (i.e. explicitly invoking nvcc), whereas I was thinking more about the “automatic” JIT that the NVIDIA GPU driver performs if a fatbinary doesn’t include SASS for the target GPU arch. Do you know how/if the driver handles DLTO JIT?
That is correct, JIT LTO is only supported manually at this time, not as part of the automatic or implicit runtime. JIT linking at the ELF level is supported in the runtime. By default when you compile with -dlto -dc it stores both LTO-IR and PTX in the fatbinary, so if you update your chip it will then do JIT compile and link of the PTX and it will work functionally, but you won’t get the LTO optimization from that. This is something that we may release later, depending on customer feedback.
Thanks for the clarification :) Having the driver automatically perform LTO when JIT compiling/linking from PTX/LTO-IR would be a great feature from our point of view, so fingers crossed!
@mmurphy1 apologies for digging up an old thread, just checking whether LTO optimization ever made it’s way into the driver JIT or if there are plans to do so? Checked the release notes up to CUDA 12.6 (currently the latest release), but didn’t see anything immediately obvious.
Sorry for a late reply, was on vacation. What we did was to provide a new runtime library, libnvJitLink, for doing runtime LTO. This is supported as of 12.0. With the library you can do JIT linking of LTO (or PTX or cubins). See the nvJitLink documentation in the 12.x cuda toolkits.