Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization

Originally published at: Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization | NVIDIA Developer Blog

CUDA 11.2 features the powerful link time optimization (LTO) feature for device code in GPU-accelerated applications. Device LTO brings the performance advantages of device code optimization that were only possible in the nvcc whole program compilation mode to the nvcc separate compilation mode, which was introduced in CUDA 5.0.Separate compilation mode allows CUDA device kernel…

Figure 2 seems to be wrong, it’s the same as Figure 1. Also it would be nice to get the figures in a higher resolution.

@rkobus – Sorry about that! It’s fixed now. Hope the larger size helps as well. Thanks for the feedback!

Is the MonteCarlo benchmark in the CUDA 11.2 sample code?

No. It was used for some internal benchmarking. Unfortunately most of the sample code does not involve separate compilation so are not good tests for LTO.

Good question. We are working on support for JIT LTO, but in 11.2 it is not supported. So in the example you give at JIT time it will JIT each individual PTX to cubin and then do a cubin link. This is the same as we have always done for JIT linking. But we should have more support for JIT LTO in future releases.

@mmurphy1 Thanks for the reply - I look forward to seeing more information in the future :)

Would you also be able to shed any light on the following: Using device link-time optimization results in much larger fatbinaries

@mmurphy1 Are there any reasons that DLTO cannot achieve the same runtime performance as the whole program compilation? Performing DLTO should be able to inline and optimize all functions thus will generate the same code as the whole program compilation, unless the linker does not always inline and optimize the code (since DLTO doesn’t have enough memory to perform the linking?).

DLTO should provide the same runtime performance as whole program. If doing “partial LTO” where some objects were not compiled with -dlto then the scope of optimization will be smaller.