Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization

jwitsoe · February 13, 2021, 1:27am

Originally published at: https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/

CUDA 11.2 features the powerful link time optimization (LTO) feature for device code in GPU-accelerated applications. Device LTO brings the performance advantages of device code optimization that were only possible in the nvcc whole program compilation mode to the nvcc separate compilation mode, which was introduced in CUDA 5.0.Separate compilation mode allows CUDA device kernel…

rkobus · February 15, 2021, 10:08am

Figure 2 seems to be wrong, it’s the same as Figure 1. Also it would be nice to get the figures in a higher resolution.

jwitsoe · February 17, 2021, 9:41pm

@rkobus – Sorry about that! It’s fixed now. Hope the larger size helps as well. Thanks for the feedback!

phw89 · May 5, 2021, 1:29pm

echobrad · May 14, 2021, 3:28pm

Is the MonteCarlo benchmark in the CUDA 11.2 sample code?

mmurphy1 · May 15, 2021, 5:06am

No. It was used for some internal benchmarking. Unfortunately most of the sample code does not involve separate compilation so are not good tests for LTO.

mmurphy1 · May 15, 2021, 5:12am

Good question. We are working on support for JIT LTO, but in 11.2 it is not supported. So in the example you give at JIT time it will JIT each individual PTX to cubin and then do a cubin link. This is the same as we have always done for JIT linking. But we should have more support for JIT LTO in future releases.

phw89 · May 20, 2021, 2:24pm

@mmurphy1 Thanks for the reply - I look forward to seeing more information in the future :)

Would you also be able to shed any light on the following: Using device link-time optimization results in much larger fatbinaries

echobrad · September 13, 2021, 8:52pm

@mmurphy1 Are there any reasons that DLTO cannot achieve the same runtime performance as the whole program compilation? Performing DLTO should be able to inline and optimize all functions thus will generate the same code as the whole program compilation, unless the linker does not always inline and optimize the code (since DLTO doesn’t have enough memory to perform the linking?).

mmurphy1 · September 21, 2021, 6:05pm

DLTO should provide the same runtime performance as whole program. If doing “partial LTO” where some objects were not compiled with -dlto then the scope of optimization will be smaller.

epk · May 16, 2022, 8:28pm

but in 11.2 it is not supported.

Ok, is it supported in 11.7?

mmurphy1 · May 17, 2022, 8:10pm

JIT LTO is supported as of 11.4, but only as a preview feature. There will be a change to the interface in 12.0 to better support our compatibility guarantees.

phw89 · November 15, 2022, 11:45am

Thanks! However, judging by the release notes for 11.4, it looks to be more for manual JIT (i.e. explicitly invoking nvcc), whereas I was thinking more about the “automatic” JIT that the NVIDIA GPU driver performs if a fatbinary doesn’t include SASS for the target GPU arch. Do you know how/if the driver handles DLTO JIT?

mmurphy1 · November 15, 2022, 5:38pm

That is correct, JIT LTO is only supported manually at this time, not as part of the automatic or implicit runtime. JIT linking at the ELF level is supported in the runtime. By default when you compile with -dlto -dc it stores both LTO-IR and PTX in the fatbinary, so if you update your chip it will then do JIT compile and link of the PTX and it will work functionally, but you won’t get the LTO optimization from that. This is something that we may release later, depending on customer feedback.

phw89 · November 16, 2022, 1:00pm

Thanks for the clarification :) Having the driver automatically perform LTO when JIT compiling/linking from PTX/LTO-IR would be a great feature from our point of view, so fingers crossed!