Pragma/Attribute to Avoid Unnecessary Template Instantiations and/or Function Analysis in the Compiler Front-End

As far as I understand from the CUDA compilation trajectory, part of the work of the compiler involves checking all functions to separate CUDA-related code (and in particular kernel invocations) from normal host code. In particular, if I understand correctly, the CUDA Front-End (cudafe++) employs source-to-source translation, for instance, to replace the kernel call syntax with the appropriate stubs for the host compiler to handle, while the proper CUDA-to-PTX compiler (cicc) will obviously also need to know which kernel it needs to compile.

For reasons that would probably be too long to explain here, I am very interested in compiling code that includes some heavy metaprogramming (using templates, constexpr functions, SFINAE and the like) to provide the desired behaviour and interfaces to classes that are meant to be used both on the host and the device, and which I can ensure include no CUDA-related calls that need to be translated (at most, and obviously only for the host, some CUDA Runtime API calls that as far as I understand should be perfectly fine to compile with the host compiler).

This does work and we do get the results we expect, but the problem is that, for a semi-realistic use case (that is still less complex than the one we are interested in), NVCC 12.4 compilation times can go up to 25 minutes, when compiling the same code with Clang for the GPU takes around 16 seconds. Benchmarking the stages of the compilation separately for a simpler use case that only takes about one minute on NVCC (and around 8 seconds with Clang), both cudafe++ and cicc take essentially the same time, the almost entirety of which is spent on the front end (as reported by running with -Xcudafe --timing), suggesting that it is related to the process of figuring out which CUDA calls and/or kernels exist. Further investigation by enabling debug outputs with -Xcudafe -d1 leads me to suspect that at least some of that time may be spent dealing with several template function specializations (as evidenced by many Finishing function body processing printouts), even constexpr ones that are used solely in compile time for SFINAE or to provide types via decltype.

Of course, what I am trying to do would always risk straining the compiler through the sheer complexity of the metaprogramming involved, but the significant differences in compilation times between NVCC and Clang (and the reasonable compilation times that GCC and MSVC also give for equivalent CPU-only code) give me hope that some mitigation is possible. In particular: is there, or could there be, an attribute or a pragma we could use to let NVCC (or, more accurately, the front-end) know it doesn’t need to fully instantiate the class and/or parse the function, since we ensure there is nothing CUDA-related to be separated there? I am aware of the nv_exec_check_disable pragma, but as I understand this only disables certain warnings. Of course, this would always be a borderline hack meant to cater for a very particular use case, but for what I am trying to achieve it would be beyond useful to reduce at least in part the large compilation times I see with NVCC.

My apologies for wasting your time if this is simply not something that NVidia is willing to support. I am also regrettably unable to provide a short repro of the long compilation times without replicating a significant portion of the code I am trying to use, but I could give more details, provide a link to the repository (as it is a header-only library) or undertake any additional benchmarking of the compilation process that might be useful, if that would help.

Thanks in advance.

Probably best to file a bug. Mark the bug as RFE (request for enhancement.)

I’m fairly certain providing an example if possible, could help. If its a header-only library, perhaps you could provide a source code example that uses those headers to demonstrate the long compile time.

Thank you for your reply!

The 25+ minute compile times are seen for the semi-realistic test included with the library:

The simpler version with around 1 minute compile time would be this:

To compile it, one only has to provide the appropriate include directory that contains the library headers and subfolders and ensure support for C++17.

Of course, the repository can be found here: Nuno Dos Santos Fernandes / EDM Overhaul · GitLab

I will try to file the bug now, thank you again.