@njuffa. Thanks for the feedback.
You are right, a bit more context would help.
We are running CUDA in C# with many different kernels. The kernels contain device code only and we are using the driver API.
Since C# cannot directly embed CUDA we needed to create a bridge between C# and CUDA. This bridge is created by using clang to extract the syntax tree and to create the necessary C# wrappers.
Then we are currently using NVCC to create PTX code. The PTX code is delivered as embedded resource in our assemblies which are loaded at runtime. The reason for PTX is that we have a wide variety of GPUs, ranging from a GTX980 to GTX3090, but also Quadro etc. PTX also allows to deliver without knowing newer GPUs in advance so we don’t need to update our applications for new GPUs which is sufficient for us. But most of the things NVCC is able is not needed by us, we only need the PTX gen
The code generation is done in an MSBuild design-time task that creates the C# wrappers and also the PTX code using clang and NVCC.
All this is then delivered as a nuget package to our internal developers. So clang is currently delivered with NUGET but the toolkit still needs to be installed. The installation of the toolkit is a problem because every developer needs to install the toolkit and since there is a wide range of different toolkits out there, we needed to fix this to one dedicated version number for a few reasons, one for example were numeric problems that forced us to stay with a dedicated version. Another problem is the strong dependency to Visual Studio and the existence of cl.exe. This always was a problem since newer VS versions didn’t support older CUDA toolkits.
All this requirements lead us to a solution where we deliver the necessary developer environment and prerequisites.
So our current solution aims at delivering clang (libclang and the necessary CUDA wrappers) and and small set of the toolkit (incl. headers and NVRTC) in our required version. The msbuild task then call clang creates the c# wrappers and with NVRTC we create the PTX code.
So when someone in our team wants to compile CUDA the only thing that has to be done now is to include our cuda nuget package and everything runs.
The question that now came up was if there are any differences between clang / NVRTC / NVCC in terms of PTX generation. Because NVCC and NVRTC are saying that they use LLVM/clang internally and I don’t exactly know what the means. Are they using clang to create the PTX or the syntax tree? LLVM is also able to create CUBIN so it is also said the NVVM does create CUBIN.
If NVCC and NVRTC are simply using clang it should produce the same PTX.
You are right that if it gets compiled at the end it shouldn’t matter, but that surely depends on the generated PTX code if the optimizer is “seeing” potential optimizations or not, right?
In the end I’m only interested in the resulting performance if the PTX from CLANG/LLVM, NVRTC or NVCC result in the same performance. I honestly have never tried and I just wanted to ask if there is any experience.