Suggestion to decrease compilation time

Hi All,

I am doing IC simulation. Circuit RTL model is turned into C code equivalent (and, or, not, … a sequence of logic operations to compute the state of outputs from state of inputs) . We have to perform this computation many times (some kind of fault injection analysis)
The sequence of ops might be a very long (>10 billions ops, related to the IC complexity) and nvcc compilation takes very long time (hours)

any suggestion will be welcome to decrease the compilation overhead.

one option would be to generate directly PTX assembly code from circuit model but efficient register allocation might be cumbersome… any advice on this topic ?

regards

The nvcc compiler first uses cicc to compile the kernel c code to PTX, then compiles the PTX code to SASS code with ptxas . Do you know which of the compilation steps takes the most time?

A faster CPU would obviously decrease the compilation time.

You could try to split the long sequence of operations into multiple functions which could then be compiled in parallel.

ptxas takes time. I already split the code in multiple sub sequences… It is really huge file:

(du -h 1/10 of the full design)

341M /tmp/tmpxft_00113093_00000000-6_kernel.ptx

PTX is a virtual ISA doubling as a compiler intermediate format. All the registers used in PTX are virtual registers, and if you look at the code generated by the CUDA toolchain, you will find that it generates this code in SSA (static single assignment) form, i.e. each virtual register is written exactly once.

Allocation of physical registers is performed by the second compilation stage that compiles PTX to SASS (machine code) for a specific GPU. This is done by ptxas, which (despite what the name may suggest) is an optimizing compiler. NVIDIA does not document the details of SASS. If you wanted to generate your own you will need to reverse engineer the relevant details. Some people have done this and published some of their findings. This work needs to be repeated for each GPU architecture. In practice companies that have gone down this path have also hired at least one compiler engineer.

Because the CUDA toolchain performs aggressive function inlining and loop unrolling, PTX can be voluminous. The largest kernels I have seen so far in real-life use were a few hundred KLOC of PTX, which would compile in a few minutes. Yours seem to be larger than that by a decimal order of magnitude. I would say that is uncharted territory.

Depending on what you do in the source code, you could try (1) reducing the size of the PTX by inhibiting loop unrolling with #pragma unroll 1 and function inlining with the __noinline__ attribute. (2) reducing the time spent optimizing the SASS by reducing the ptxas optimization level, e.g. with -Xptxas -O1 (default is -O3). Needless to say, these measures will likely reduce the performance of the generated code (3) using separate compilation (as has already been suggested) to parallelize compilation, however this may lead to lengthy device link times, in particular when link-time optimization is enabled.

Other than that, you could invest in a very fast compilation platform. Based on the gcc portion of the SPEC CPU benchmarks, the fastest CPUs for compilation tasks are in the AMD EPYC 9005 series (Turin), in particular the models with an ‘F’ suffix indicating a frequency optimized CPU: 9175F, 9275F, 9375F, 9475F, 9575F. These parts were introduced in the fall of 2024; I think all models are shipping at this time. You would want large and very high throughput system memory. Since these CPUs all use a 12-channel DDR5 memory subsystem, you can simply populate all 12 channels with the fastest speed grade supported, which I think is DDR5-6000. The CUDA toolchain uses a number of intermediate files, all of which will be large given the size of the intermediate PTX code, so it seems advisable to use a PCIe4 or PCIe5 NVMe SSD for fast mass storage.

Realistically, with millions of lines of code your compilation times will continue to be lengthy even when attacking the problem from all angles, just somewhat less lengthy than before.

Depending on what kind of organization you work for, you may want to try to establish a closer engagement with NVIDIA, maybe by getting in touch with NVIDIA’s DevTech engineers at a GTC (GPU Technology Conference), or through an existing liaison your organization has with NVIDIA. Generally speaking, technology companies are typically excited about hearing about the details of challenging use cases.

I suspect it’s not applicable in your case, but if you’re compiling for multiple GPU arch, this may be worthwhile.

Too large kernels are typically rather slow as the instruction caches cannot handle the program size. Your approach of using bytecode and an interpreter is probably the way to go.

yes thank for your advice. is your work on this topic (bytecode interpreter) publicly available ?

It was @hugo32 who did previous work on bytecodes.

1 Like