Building Cuda Code with Clang

I need dynamic runtime compilation for my project, but Cuda requires Visual Studio to compile anything on Windows and OpenCL has proven to be too buggy and unstable. DirectCompute isn’t portable. So… I need an alternative.

In theory, I can use the Clang/LLVM toolchain to target CUDA directly, and looking at the source I see the NVPTX target in the LLVM sourcetree and CodeGenCUDA in the clang sourcetree.

Thus my question: How do I configure Clang to compile code to a valid PTX kernel? Does it support things like cuda intrinsics or things like unified memory, cudaMemcopyToSymbol or texture support? What are the current limitations?

Any information would be quite helpful, since the only things I can find are a few mentions in various blogs, mailing lists, and power point presentations.

It might be easier to keep using nvcc but only compile device code with it (nvcc --gpu …)?

You’d still need to provide a C preprocessor to it (from Clang or gcc because they are supported on Macs / Linux respectively). And you’d need to use the driver API to load it. But it still sounds a hell of a lot easier than compiling your own device toolchain from unsupported sources and weeding out all the bugs.

Is this even possible? Last time I checked, nvcc refused to do anything without cl.exe from VC being in the path. Has that been changed?

Well, I guess it would not be too hard renaming another C compiler or writing a little wrapper that converts the command line arguments (which most likely will be necessary). Certainly easier than building your own compiler.

I don’t know about the Windows platform toolset, but indeed I just finished creating a simple application in C++ using OpenCV that launches a CUDA kernel. Compiling the application is done in the following steps:

  1. Compile the main application into object files with g++
  2. Compile the kernel with nvcc into an object file
  3. Link all of the object files with g++ (including the CUDA runtimes and OpenCV libraries)

I would assume that you could follow a similar workflow with a Clang/LLVM-based toolchain.