NVCC at Runtime - End User Friendly Configuration Compiling GPU code without requiring Visual Studio

What’s the best way to compile a kernal at runtime, assuming that I want minimal requirements for what the end user must install to get it working? Is there a way to run NVCC to compile GPU only code without requiring cl.exe?

I know that pyCUDA offers runtime compilation of kernals, but getting it installed and running is far more involved that I can reasonably expect the end user to figure out. What I need is a self-contained, user friendly way to allow user customized kernals to be compiled and executed on the fly.

My application is a flame fractal viewer, and one of my goals is to allow the user to enter in custom formulas and explore the fractals they produce.

You must have another compiler/linker to generate the host code. NVCC generates only the GPU code and invokes the other to build the whole program. In the meantime, CUDA for Win only supports Visual C++ compiler, so there is no way to avoid it.

You may let the users compile code in Linux or have a dedicated server receiving customized code, compiling them, and sending back to the users.

generate PTX and JIT?

You’ll need to ship nvcc with your application, and call it from your program, but I’m not sure that EULA allows it.

You can also switch to OpenCL, it supports dynamic compilation.

Is that what this post is talking about?

That means in order to “allow the user to enter in custom formulas”, Keldor314 would have to dynamically turn the user input into PTX and insert it into the file resulting from nvcc --ptx?

OpenCL - when it is expected to be released? I registered as developer but still

do not get any response.

don’t know, hopefully not so far from here

Some of the Microsoft SDKs come with a compiler and linker (notably lacking any GUI IDEs) - but it is a version of the Visual C++ Compiler.

Basically, I just want to compile a kernal. Pure GPU code, so why is cl.exe needed? Running nvcc --ptx still causes it to complain about cl.exe missing from the path. It would be nice to be able to invoke NVCC just to produce some PTX, without touching CPU side stuff at all.

What would really be nice is a cuCompileModule() function that takes a .cu file and compiles all the GPU code. We have this for DirectX shaders, OpenCL (when it is released), why not for cuda?

because the front end still depends on a number of compiler-specific options to guarantee correctness (e.g., sizeof(type) on the device == sizeof(type) on the host). OCL doesn’t necessarily make these guarantees because of how it handles memory, these guarantees don’t make sense for shaders, so CUDA is a special case.

Hmm. I see. Is there a way to work around that - embed the necessary data in the host exe and have the compile use that? So long as the compiler specific stuff is known at runtime, correctness wouldn’t be a problem assuming that the correctness dependent data was passed along.

Compiler is only used as a preprocessor if you generate device data only (i.e. -ptx or -cubin flag). You can bundle a preprocessor (named cl.exe) and nvcc.exe with your app and do compilations at runtime.

What things are done in the preprocessing stage by cl.exe?


It would be possible to create a wrapper around GCC named “cl.exe” that translates Visual Studio’s command line options into GCC compiler options. One only needs to support the options used by the CUDA frontend.

A windows native GCC implementation is available in a package called MinGW - it does not have a lot of extra dependencies.

Do any of you happen to know which options are used?

when I compile a file with “nvcc --cubin kernels.cu” I get following calls to the compiler:

cl -D__CUDA_ARCH__=100 -nologo -E -TP -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS “-Ic:\Program Files\CUDA\bin/…/include” “-Ic:\Program Files\CUDA\bin/…/include/cudart” -I. -D__CUDACC__ -C -FI cuda_runtime.h kernels.cu

cl -D__CUDA_ARCH__=100 -nologo -E -TC -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS “-Ic:\Program Files\CUDA\bin/…/include” “-Ic:\Program Files\CUDA\bin/…/include/cudart” -I. -D__CUDACC__ -C C:\Users\HomeUse\AppData\Local\Temp/tmpxft_00000c14_00000000-3_kernels.cudafe1.gpu

cl -D__CUDA_ARCH__=100 -nologo -E -TC -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS “-Ic:\Program Files\CUDA\bin/…/include” “-Ic:\Program Files\CUDA\bin/…/include/cudart” -I. -D__GNUC__ -D__CUDABE__ C:\Users\HomeUse\AppData\Local\Temp/tmpxft_00000c14_00000000-8_kernels.cudafe2.gpu