Suggestion to decrease compilation time

raph38130 · January 15, 2025, 4:04pm

Hi All,

I am doing IC simulation. Circuit RTL model is turned into C code equivalent (and, or, not, … a sequence of logic operations to compute the state of outputs from state of inputs) . We have to perform this computation many times (some kind of fault injection analysis)
The sequence of ops might be a very long (>10 billions ops, related to the IC complexity) and nvcc compilation takes very long time (hours)

any suggestion will be welcome to decrease the compilation overhead.

one option would be to generate directly PTX assembly code from circuit model but efficient register allocation might be cumbersome… any advice on this topic ?

regards

striker159 · January 15, 2025, 4:20pm

The nvcc compiler first uses cicc to compile the kernel c code to PTX, then compiles the PTX code to SASS code with ptxas . Do you know which of the compilation steps takes the most time?

A faster CPU would obviously decrease the compilation time.

You could try to split the long sequence of operations into multiple functions which could then be compiled in parallel.

raph38130 · January 15, 2025, 4:27pm

ptxas takes time. I already split the code in multiple sub sequences… It is really huge file:

(du -h 1/10 of the full design)

341M /tmp/tmpxft_00113093_00000000-6_kernel.ptx

njuffa · January 15, 2025, 5:07pm

PTX is a virtual ISA doubling as a compiler intermediate format. All the registers used in PTX are virtual registers, and if you look at the code generated by the CUDA toolchain, you will find that it generates this code in SSA (static single assignment) form, i.e. each virtual register is written exactly once.

Allocation of physical registers is performed by the second compilation stage that compiles PTX to SASS (machine code) for a specific GPU. This is done by ptxas, which (despite what the name may suggest) is an optimizing compiler. NVIDIA does not document the details of SASS. If you wanted to generate your own you will need to reverse engineer the relevant details. Some people have done this and published some of their findings. This work needs to be repeated for each GPU architecture. In practice companies that have gone down this path have also hired at least one compiler engineer.

Because the CUDA toolchain performs aggressive function inlining and loop unrolling, PTX can be voluminous. The largest kernels I have seen so far in real-life use were a few hundred KLOC of PTX, which would compile in a few minutes. Yours seem to be larger than that by a decimal order of magnitude. I would say that is uncharted territory.

Depending on what you do in the source code, you could try (1) reducing the size of the PTX by inhibiting loop unrolling with #pragma unroll 1 and function inlining with the __noinline__ attribute. (2) reducing the time spent optimizing the SASS by reducing the ptxas optimization level, e.g. with -Xptxas -O1 (default is -O3). Needless to say, these measures will likely reduce the performance of the generated code (3) using separate compilation (as has already been suggested) to parallelize compilation, however this may lead to lengthy device link times, in particular when link-time optimization is enabled.

Other than that, you could invest in a very fast compilation platform. Based on the gcc portion of the SPEC CPU benchmarks, the fastest CPUs for compilation tasks are in the AMD EPYC 9005 series (Turin), in particular the models with an ‘F’ suffix indicating a frequency optimized CPU: 9175F, 9275F, 9375F, 9475F, 9575F. These parts were introduced in the fall of 2024; I think all models are shipping at this time. You would want large and very high throughput system memory. Since these CPUs all use a 12-channel DDR5 memory subsystem, you can simply populate all 12 channels with the fastest speed grade supported, which I think is DDR5-6000. The CUDA toolchain uses a number of intermediate files, all of which will be large given the size of the intermediate PTX code, so it seems advisable to use a PCIe4 or PCIe5 NVMe SSD for fast mass storage.

Realistically, with millions of lines of code your compilation times will continue to be lengthy even when attacking the problem from all angles, just somewhat less lengthy than before.

Depending on what kind of organization you work for, you may want to try to establish a closer engagement with NVIDIA, maybe by getting in touch with NVIDIA’s DevTech engineers at a GTC (GPU Technology Conference), or through an existing liaison your organization has with NVIDIA. Generally speaking, technology companies are typically excited about hearing about the details of challenging use cases.

rs277 · January 15, 2025, 6:05pm

I suspect it’s not applicable in your case, but if you’re compiling for multiple GPU arch, this may be worthwhile.

Curefab · January 17, 2025, 10:16am

Too large kernels are typically rather slow as the instruction caches cannot handle the program size. Your approach of using bytecode and an interpreter is probably the way to go.

raph38130 · January 17, 2025, 10:37am

yes thank for your advice. is your work on this topic (bytecode interpreter) publicly available ?

Curefab · January 17, 2025, 10:49am

It was @hugo32 who did previous work on bytecodes.

Topic		Replies	Views
Ptxas compiler speed. CUDA Programming and Performance	23	12050	December 20, 2012
Very long kernels resulting in unoptimized compilation CUDA Programming and Performance	2	447	March 10, 2023
Reducing Application Build Times Using CUDA C++ Compilation Aids Technical Blog	1	628	October 31, 2021
Crowd sourcing request: help me time the PTX ISA. CUDA Programming and Performance	8	1891	July 2, 2019
Is there a chance that Ptxas.exe will use all cores of the CPU ? This would be a great improvement o CUDA Programming and Performance	10	8827	December 30, 2010
Slow compile and cudaMalloc CUDA Programming and Performance	8	3683	February 2, 2011
why adding 1 line =exploding time to compile CUDA Programming and Performance	13	8444	June 8, 2009
Lower Level CUDA NVasc CUDA Programming and Performance	20	17986	July 10, 2007
Generate CUDA at run-time ? CUDA Programming and Performance	13	3064	September 28, 2011
PTX instructions are reordered CUDA Programming and Performance	12	1447	May 13, 2024

Suggestion to decrease compilation time

Related topics