CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library

Originally published at:

CUDA Toolkit 12.0 introduces a new nvJitLink library for Just-in-Time Link Time Optimization (JIT LTO) support.

some problems have annoyed me,like following statement:
"JIT LTO minimizes the impact on binary size by enabling the cuFFT library to build LTO optimized speed-of-light (SOL) kernels for any parameter combination, at runtime. This is achieved by shipping the building blocks of FFT kernels instead of specialized FFT kernels. "
can you explain what ”the building blocks of FFT kernels“ means?


Thanks for the question. I am not the FFT developer, but in general what they have done is decompose their algorithm into individual pieces. Previously the library was very large because they provided all permutations of an algorithm. Now they just have a handful of building blocks which they can combine into a specific permutation at runtime. They gave a GTC talk about the work they have done which has some more details.

Looking at cuFFTDx library (C++ header only) can give good insight on what can be considered FFT building blocks. Bit more summarized view from another point of view would be SIAM PP22 presentation (slide 10).