Clang vs. NVCC vs NVRTC which one to use

We are currently making some decisions on our code generation. One question came up if we could completely get rid of NVCC and use clang instead to just create the PTX code. But a few questions came up:

  1. Does this make sense at all?
  2. Does clang create the same PTX code as NVCC does
  3. Should we better use NVRTC?

As a side note: We prefer tooling that doesn’t need to be called via command-line but an API instead.
clang has this and NVRTC would be the alternative. So maybe NVCC is out of the game (?).

I would be glad to get some information on this.

Could you explain how you envision a third party to provide give useful advice without knowledge of the relevant context, i.e. your use case?

The only potentially actionable information I see is that you “prefer tooling that doesn’t need to be called via command-line but an API instead”. Which makes me wonder whether we are presented with a XY problem here. It’s preferred, but not required. What factors drive the preference?

Me personally, I have been using command-line compilers for forty years, and CUDA for 15 years, and nvcc serves me just fine. There may well be reasons for you to find it lacking in some regard, but it is not clear what exactly from the posted question. Consider adding that information.

I have not used clang to generate PTX code, but I would be extremely surprised if it spits out PTX code that is identical to that generated by nvcc. On the other hand, compiler technology has advanced to the point where it would be reasonable to assume that the generated code is quite similar.

Keep in mind, though, that PTX is an intermediate representation which gets compiled by an optimizing compiler (the ptxas component) into machine code (SASS). Given that, why would it matter how different (or not) the generated PTX is? It is not clear from the question.

@njuffa. Thanks for the feedback.

You are right, a bit more context would help.

We are running CUDA in C# with many different kernels. The kernels contain device code only and we are using the driver API.

Since C# cannot directly embed CUDA we needed to create a bridge between C# and CUDA. This bridge is created by using clang to extract the syntax tree and to create the necessary C# wrappers.

Then we are currently using NVCC to create PTX code. The PTX code is delivered as embedded resource in our assemblies which are loaded at runtime. The reason for PTX is that we have a wide variety of GPUs, ranging from a GTX980 to GTX3090, but also Quadro etc. PTX also allows to deliver without knowing newer GPUs in advance so we don’t need to update our applications for new GPUs which is sufficient for us. But most of the things NVCC is able is not needed by us, we only need the PTX gen

The code generation is done in an MSBuild design-time task that creates the C# wrappers and also the PTX code using clang and NVCC.

All this is then delivered as a nuget package to our internal developers. So clang is currently delivered with NUGET but the toolkit still needs to be installed. The installation of the toolkit is a problem because every developer needs to install the toolkit and since there is a wide range of different toolkits out there, we needed to fix this to one dedicated version number for a few reasons, one for example were numeric problems that forced us to stay with a dedicated version. Another problem is the strong dependency to Visual Studio and the existence of cl.exe. This always was a problem since newer VS versions didn’t support older CUDA toolkits.

All this requirements lead us to a solution where we deliver the necessary developer environment and prerequisites.

So our current solution aims at delivering clang (libclang and the necessary CUDA wrappers) and and small set of the toolkit (incl. headers and NVRTC) in our required version. The msbuild task then call clang creates the c# wrappers and with NVRTC we create the PTX code.

So when someone in our team wants to compile CUDA the only thing that has to be done now is to include our cuda nuget package and everything runs.

The question that now came up was if there are any differences between clang / NVRTC / NVCC in terms of PTX generation. Because NVCC and NVRTC are saying that they use LLVM/clang internally and I don’t exactly know what the means. Are they using clang to create the PTX or the syntax tree? LLVM is also able to create CUBIN so it is also said the NVVM does create CUBIN.

If NVCC and NVRTC are simply using clang it should produce the same PTX.

You are right that if it gets compiled at the end it shouldn’t matter, but that surely depends on the generated PTX code if the optimizer is “seeing” potential optimizations or not, right?

In the end I’m only interested in the resulting performance if the PTX from CLANG/LLVM, NVRTC or NVCC result in the same performance. I honestly have never tried and I just wanted to ask if there is any experience.

That is a lot of good information for forum participants to base their answers on.

The CUDA compiler uses NVVM, which is an NVIDIA derivative of LLVM. When I grep through cicc.exe from CUDA 11.1 I see numerous symbols related to EDG, and none related to clang, which suggests the EDG C++ frontend is being used. To my knowledge, an alternative compilation path using clang + LLVM has been useable with CUDA since around 2015 / 2016, but I have never tried it, so I cannot speak to that.

1 Like