PTX is a virtual ISA doubling as a compiler intermediate format. All the registers used in PTX are virtual registers, and if you look at the code generated by the CUDA toolchain, you will find that it generates this code in SSA (static single assignment) form, i.e. each virtual register is written exactly once.
Allocation of physical registers is performed by the second compilation stage that compiles PTX to SASS (machine code) for a specific GPU. This is done by ptxas
, which (despite what the name may suggest) is an optimizing compiler. NVIDIA does not document the details of SASS. If you wanted to generate your own you will need to reverse engineer the relevant details. Some people have done this and published some of their findings. This work needs to be repeated for each GPU architecture. In practice companies that have gone down this path have also hired at least one compiler engineer.
Because the CUDA toolchain performs aggressive function inlining and loop unrolling, PTX can be voluminous. The largest kernels I have seen so far in real-life use were a few hundred KLOC of PTX, which would compile in a few minutes. Yours seem to be larger than that by a decimal order of magnitude. I would say that is uncharted territory.
Depending on what you do in the source code, you could try (1) reducing the size of the PTX by inhibiting loop unrolling with #pragma unroll 1
and function inlining with the __noinline__
attribute. (2) reducing the time spent optimizing the SASS by reducing the ptxas
optimization level, e.g. with -Xptxas -O1
(default is -O3
). Needless to say, these measures will likely reduce the performance of the generated code (3) using separate compilation (as has already been suggested) to parallelize compilation, however this may lead to lengthy device link times, in particular when link-time optimization is enabled.
Other than that, you could invest in a very fast compilation platform. Based on the gcc portion of the SPEC CPU benchmarks, the fastest CPUs for compilation tasks are in the AMD EPYC 9005 series (Turin), in particular the models with an ‘F’ suffix indicating a frequency optimized CPU: 9175F, 9275F, 9375F, 9475F, 9575F. These parts were introduced in the fall of 2024; I think all models are shipping at this time. You would want large and very high throughput system memory. Since these CPUs all use a 12-channel DDR5 memory subsystem, you can simply populate all 12 channels with the fastest speed grade supported, which I think is DDR5-6000. The CUDA toolchain uses a number of intermediate files, all of which will be large given the size of the intermediate PTX code, so it seems advisable to use a PCIe4 or PCIe5 NVMe SSD for fast mass storage.
Realistically, with millions of lines of code your compilation times will continue to be lengthy even when attacking the problem from all angles, just somewhat less lengthy than before.
Depending on what kind of organization you work for, you may want to try to establish a closer engagement with NVIDIA, maybe by getting in touch with NVIDIA’s DevTech engineers at a GTC (GPU Technology Conference), or through an existing liaison your organization has with NVIDIA. Generally speaking, technology companies are typically excited about hearing about the details of challenging use cases.