CUDA + user scripting (e.g. Lua)

I have a graphics project in mind which may not be possible yet, or at least difficult. I’d like to able to use something like the interpreted language - Lua - to allow a user to create their own scripts which run inside my main program/GUI (it has to communicate with the main program btw). However, the catch is, their code has to use CUDA for that extra speed.

What are the kinds of possibilities open to me here if any? Here are some potential leads with my limited amount of knowledge:

1: From what I understand, a language such as C# has some kind of ‘reflection’ feature which afaik can allow scripts within programs, but I’m not sure if those parts of the code can use CUDA (or indeed how much C# supports CUDA full stop).

2: CUDA uses a compiler to create its code AFAIK. The problem I have there is that it takes a certain of time to compile, and I’d like an ‘instant feedback’ environment so that the user can experiment with different functions immediately. However it’s still a possibility, especially if a given user kernel is only < 10 lines of code or so and that turns out to be quick for CUDA to compile. However, I’m not even sure whether Nvidia will allow me to include their compiler in my potentially commercial, closed source (for now) app.

3: Some kind of JIT combination with C++? This one might be completely off track since I’m unsure of how, where and when JIT can be used with what, where and how.

I’m not sure yet whether I want to use Qt, or .NET (Winforms or maybe WPF) for the GUI, though I’m not sure that has much bearing on the above problem. I’d like to also use SDL, and maybe make the app 64 bit too if possible.

I have a graphics project in mind which may not be possible yet, or at least difficult. I’d like to able to use something like the interpreted language - Lua - to allow a user to create their own scripts which run inside my main program/GUI (it has to communicate with the main program btw). However, the catch is, their code has to use CUDA for that extra speed.

What are the kinds of possibilities open to me here if any? Here are some potential leads with my limited amount of knowledge:

1: From what I understand, a language such as C# has some kind of ‘reflection’ feature which afaik can allow scripts within programs, but I’m not sure if those parts of the code can use CUDA (or indeed how much C# supports CUDA full stop).

2: CUDA uses a compiler to create its code AFAIK. The problem I have there is that it takes a certain of time to compile, and I’d like an ‘instant feedback’ environment so that the user can experiment with different functions immediately. However it’s still a possibility, especially if a given user kernel is only < 10 lines of code or so and that turns out to be quick for CUDA to compile. However, I’m not even sure whether Nvidia will allow me to include their compiler in my potentially commercial, closed source (for now) app.

3: Some kind of JIT combination with C++? This one might be completely off track since I’m unsure of how, where and when JIT can be used with what, where and how.

I’m not sure yet whether I want to use Qt, or .NET (Winforms or maybe WPF) for the GUI, though I’m not sure that has much bearing on the above problem. I’d like to also use SDL, and maybe make the app 64 bit too if possible.

Take a look at HOOMD, which does molecular simulations with CUDA. It has an integrated Python scripting language.

Take a look at HOOMD, which does molecular simulations with CUDA. It has an integrated Python scripting language.

Python isn’t something I considered, but that seems well suited to the task like Lua, considering it’s an interpreted language. I suppose you’re saying that HOOMD does what I would want to do?

Would it be as fast though as using the standard C/C++ CUDA toolkit libraries as given by NVidia? I have heard rumours elsewhere that even using C/C++ I can allow the user to create arbitrary code/functions and for these to be executed on the GPU at runtime. Is this true? It had something to do with using the “driver API” instead of the “runtime API” whatever that entails.

Also, I’m still considering supplying the CUDA compiler with my program so that RTCG isn’t needed. What do NVidia think about this, and would arbitrary code take a while to compile (is there a fixed time cost, on top of the time*lines of code cost, which would create lag?). If the user can create a 10 line piece of code, and it compile in under say a quarter of a second, then this will essentially feel like ‘interpreted’ to the user anyway.

Everything is still very foggy in my mind about how to go about all this. I’ll try in the meantime to get HOOMD to work (it’s throwing an error at the moment when I try to boot it).

Python isn’t something I considered, but that seems well suited to the task like Lua, considering it’s an interpreted language. I suppose you’re saying that HOOMD does what I would want to do?

Would it be as fast though as using the standard C/C++ CUDA toolkit libraries as given by NVidia? I have heard rumours elsewhere that even using C/C++ I can allow the user to create arbitrary code/functions and for these to be executed on the GPU at runtime. Is this true? It had something to do with using the “driver API” instead of the “runtime API” whatever that entails.

Also, I’m still considering supplying the CUDA compiler with my program so that RTCG isn’t needed. What do NVidia think about this, and would arbitrary code take a while to compile (is there a fixed time cost, on top of the time*lines of code cost, which would create lag?). If the user can create a 10 line piece of code, and it compile in under say a quarter of a second, then this will essentially feel like ‘interpreted’ to the user anyway.

Everything is still very foggy in my mind about how to go about all this. I’ll try in the meantime to get HOOMD to work (it’s throwing an error at the moment when I try to boot it).

I’ve also enjoyed using PyCUDA, which supports compiling kernels on the fly:

http://mathema.tician.de/software/pycuda

I’ve also enjoyed using PyCUDA, which supports compiling kernels on the fly:

http://mathema.tician.de/software/pycuda

Seibert, thanks too. I’ve seen PyCUDA crop up a few times in my research with all this. Even though no mention is made of user generated code, it would seem that PyCUDA supports that implicitly. Well, it must do if it allows metaprogramming.

I don’t know Python at all, but am prepared to learn it just for this. I think my main concern with using Python/PyCUDA is that, well, it’s an interpreted language, and so speed may suffer. But then maybe programming for the GPU is a different ball game, and the differences between compiled and interpreted code pale into significance with this new paradigm? What do you think? How would the speed of generated code with PyCUDA compare to something like say, JIT or where the user compiles their own C/C++ code with nvcc each time?

Speed is incredibly crucial in what I’m hoping to achieve (essentially building and then ray-tracing arbitrary user functions). Are there any disadvantages using PyCUDA rather than the CUDA C/C++ libraries directly in a C/C++ enviroment?

It also makes choosing a GUI interface trickier. I had pinned my hopes on .NET or Qt, but I’m not sure if they support Python very well. Unless I somehow mix Python and C/C++ code which could be messy (I have little experience in mixing languages, but it’s an idea).

Going back to using C/C++, does anyone reading this know much about using the driver API instead of the runtime API? Apparently, this can be used to allow for dynamically generated code to run on the fly, and this may very well be an option for me. See this below thread on these very forums which explores this idea:

http://forums.nvidia.com/index.php?showtopic=50325

If that’s doable, how would that approach compare to using PyCUDA?

Seibert, thanks too. I’ve seen PyCUDA crop up a few times in my research with all this. Even though no mention is made of user generated code, it would seem that PyCUDA supports that implicitly. Well, it must do if it allows metaprogramming.

I don’t know Python at all, but am prepared to learn it just for this. I think my main concern with using Python/PyCUDA is that, well, it’s an interpreted language, and so speed may suffer. But then maybe programming for the GPU is a different ball game, and the differences between compiled and interpreted code pale into significance with this new paradigm? What do you think? How would the speed of generated code with PyCUDA compare to something like say, JIT or where the user compiles their own C/C++ code with nvcc each time?

Speed is incredibly crucial in what I’m hoping to achieve (essentially building and then ray-tracing arbitrary user functions). Are there any disadvantages using PyCUDA rather than the CUDA C/C++ libraries directly in a C/C++ enviroment?

It also makes choosing a GUI interface trickier. I had pinned my hopes on .NET or Qt, but I’m not sure if they support Python very well. Unless I somehow mix Python and C/C++ code which could be messy (I have little experience in mixing languages, but it’s an idea).

Going back to using C/C++, does anyone reading this know much about using the driver API instead of the runtime API? Apparently, this can be used to allow for dynamically generated code to run on the fly, and this may very well be an option for me. See this below thread on these very forums which explores this idea:

http://forums.nvidia.com/index.php?showtopic=50325

If that’s doable, how would that approach compare to using PyCUDA?

I think you might misunderstand what PyCUDA provides. It consists of:

  • A Python wrapper around CUDA host functions (device initialization, memory allocation, etc).

  • A reimplementation of some parts of the numpy interface that allows common operations on arrays to be offloaded to the GPU almost transparently.

(Since you aren’t familiar with Python: Numpy is basically the standard python library for mathematical operations on n-dimensional arrays. It is similar in some respects to the toolbox provided by programs like Matlab. Numpy is based on an efficient, compiled core in C, so Python applications that use it run pretty fast because the most expensive operations don’t run in the interpreter.)

  • An interface for compiling and loading kernels at runtime that have been written in standard CUDA C (well, more like C++ these days). PyCUDA takes kernel source in a string and calls nvcc on your behalf and then loads the resulting module onto the GPU. This is the origin of the “metaprogramming”, whereby you can construct custom kernels at runtime and load them dynamically.

PyCUDA is most productive when your time critical pieces are going onto the GPU. If you are going to have significant calculation on the host and device at the same time, you probably don’t want Python. That said, I’m maybe unclear on the goal here. You just want to embed a scripting language, not write your main program in a scripting language, right? In that case, I don’t know that it matters what you do, since you can just expose to the user scripts whatever interface you want, and then call the appropriate CUDA functions on their behalf in your compiled code.

I’m starting to think Python is less of a good idea here. There are Python bindings for QT (and they are used on Linux by a variety of people), but I have no idea what the issues are for writing a GUI Python application targeted at Windows. You’re now outside my realm of experience. :)

The driver API allows compilation of PTX at runtime, which is basically a virtual machine assembly language. The advantage is that you have no dependency on nvcc (which means you don’t need to worry about the issues of shipping the CUDA toolkit to your users), but code generation at runtime becomes much harder for your main program. That said, if the kind of code generation you plan to do could be achieved with some relatively simple string manipulation of precompiled PTX chunks, then I think the driver API would be the way to go.

I think you might misunderstand what PyCUDA provides. It consists of:

  • A Python wrapper around CUDA host functions (device initialization, memory allocation, etc).

  • A reimplementation of some parts of the numpy interface that allows common operations on arrays to be offloaded to the GPU almost transparently.

(Since you aren’t familiar with Python: Numpy is basically the standard python library for mathematical operations on n-dimensional arrays. It is similar in some respects to the toolbox provided by programs like Matlab. Numpy is based on an efficient, compiled core in C, so Python applications that use it run pretty fast because the most expensive operations don’t run in the interpreter.)

  • An interface for compiling and loading kernels at runtime that have been written in standard CUDA C (well, more like C++ these days). PyCUDA takes kernel source in a string and calls nvcc on your behalf and then loads the resulting module onto the GPU. This is the origin of the “metaprogramming”, whereby you can construct custom kernels at runtime and load them dynamically.

PyCUDA is most productive when your time critical pieces are going onto the GPU. If you are going to have significant calculation on the host and device at the same time, you probably don’t want Python. That said, I’m maybe unclear on the goal here. You just want to embed a scripting language, not write your main program in a scripting language, right? In that case, I don’t know that it matters what you do, since you can just expose to the user scripts whatever interface you want, and then call the appropriate CUDA functions on their behalf in your compiled code.

I’m starting to think Python is less of a good idea here. There are Python bindings for QT (and they are used on Linux by a variety of people), but I have no idea what the issues are for writing a GUI Python application targeted at Windows. You’re now outside my realm of experience. :)

The driver API allows compilation of PTX at runtime, which is basically a virtual machine assembly language. The advantage is that you have no dependency on nvcc (which means you don’t need to worry about the issues of shipping the CUDA toolkit to your users), but code generation at runtime becomes much harder for your main program. That said, if the kind of code generation you plan to do could be achieved with some relatively simple string manipulation of precompiled PTX chunks, then I think the driver API would be the way to go.

I wrote the Kappa framework to do what you are asking for (and more than you are asking for–at least that you have mentioned so far–you did not mention it being free ;) ). Take a look at psilambda.com. If you want the fastest speed (faster or as fast as straight CUDA API calls) use the Kappa framework–it has a scheduler that lets you define the processing flow so it can schedule more concurrent kernels onto a GPU than straight CUDA can, it gives you the flexibility you are looking for, it is 100% compiled (C++) multi-processor threaded code so it will be significantly faster than interpreted languages, it is Driver API based with JIT. If you still need an interpreted language for some reason, the Kappa framework can interact with Perl as the same type of coequal as CUDA or OpenMP C++ and integrates with a lot of interpreted languages (including Lua).

I wrote the Kappa framework to do what you are asking for (and more than you are asking for–at least that you have mentioned so far–you did not mention it being free ;) ). Take a look at psilambda.com. If you want the fastest speed (faster or as fast as straight CUDA API calls) use the Kappa framework–it has a scheduler that lets you define the processing flow so it can schedule more concurrent kernels onto a GPU than straight CUDA can, it gives you the flexibility you are looking for, it is 100% compiled (C++) multi-processor threaded code so it will be significantly faster than interpreted languages, it is Driver API based with JIT. If you still need an interpreted language for some reason, the Kappa framework can interact with Perl as the same type of coequal as CUDA or OpenMP C++ and integrates with a lot of interpreted languages (including Lua).

That was the original idea yes. I’ve heard about ‘extending’ (so Python becomes the main program which calls C through DLLs etc.) instead of ‘embedding’ and I thought that was a possibility, but in hindsight probably less so if I want to use Qt etc., and it would mean more work, as all the code’s in C/C++ at the mo.

I think this is actually a big issue for me. I can’t legally redistribute the nvcc compiler as far as I know, so either I study the ‘driver API’ idea to see easy it would be for me to implement, or I use something like OpenCL which will allow me to distribute their compiler.

I’m also currently looking to find a decent C/C++ compiler so that I can at least compile code to the CPU (this also might make things easier to compile to GPU code, if I have both a GPU and CPU compiler?)

Thanks for the info.

That was the original idea yes. I’ve heard about ‘extending’ (so Python becomes the main program which calls C through DLLs etc.) instead of ‘embedding’ and I thought that was a possibility, but in hindsight probably less so if I want to use Qt etc., and it would mean more work, as all the code’s in C/C++ at the mo.

I think this is actually a big issue for me. I can’t legally redistribute the nvcc compiler as far as I know, so either I study the ‘driver API’ idea to see easy it would be for me to implement, or I use something like OpenCL which will allow me to distribute their compiler.

I’m also currently looking to find a decent C/C++ compiler so that I can at least compile code to the CPU (this also might make things easier to compile to GPU code, if I have both a GPU and CPU compiler?)

Thanks for the info.

That’s actually a very good point. While NVIDIA offers JIT compilation of PTX right inside the graphics driver, both AMD and NVIDIA have to offer JIT compilation of OpenCL code in their OpenCL drivers. Since OpenCL is much higher level than PTX (looks a lot like CUDA minus the C++ features), that would be much easier for your program to generate. Moreover, it would widen your user base to basically everyone with a high performance graphics card. As long as the more limited OpenCL compute model works for your application, I think this would be the way to go.

nvcc depends to some extent on the CPU compiler, but that should be only when compiling code with the Runtime API. I’ve never tried running nvcc without a CPU compiler present, so I have no idea what potential issues there are.

One additional thing to consider: OpenCL is supposed to target either the CPU or the GPU when the appropriate drivers are present. For example, the ATI Stream SDK lets you target both multicore CPUs with SSE and AMD (ATI) GPUs. If the implementations are good, this could be better than trying to write your own multithreaded SSE CPU implementation by hand.

That’s actually a very good point. While NVIDIA offers JIT compilation of PTX right inside the graphics driver, both AMD and NVIDIA have to offer JIT compilation of OpenCL code in their OpenCL drivers. Since OpenCL is much higher level than PTX (looks a lot like CUDA minus the C++ features), that would be much easier for your program to generate. Moreover, it would widen your user base to basically everyone with a high performance graphics card. As long as the more limited OpenCL compute model works for your application, I think this would be the way to go.

nvcc depends to some extent on the CPU compiler, but that should be only when compiling code with the Runtime API. I’ve never tried running nvcc without a CPU compiler present, so I have no idea what potential issues there are.

One additional thing to consider: OpenCL is supposed to target either the CPU or the GPU when the appropriate drivers are present. For example, the ATI Stream SDK lets you target both multicore CPUs with SSE and AMD (ATI) GPUs. If the implementations are good, this could be better than trying to write your own multithreaded SSE CPU implementation by hand.

I wasn’t going to post anything on the forums about it until Monday (since I’m still writing the documentation and code examples), but I posted the first beta version of GPU.NET on our website last night (see the link in my sig. below). It does RTCG from .NET, so you could have your fancy GUI and you don’t have to ship nvcc or the CUDA runtime with your app. Our nVidia plugin generates PTX, so the device code is pretty much the same as what you’d get out of nvcc.

As far as scripting languages go, what about F# interactive? You could have your customers write standard F# scripts against an API exposed by your program, then it’d all get compiled to .NET and run through GPU.NET. (NOTE: I’m still working on F# support, so the beta won’t let you do this, yet.) I’m also looking into what we need to do for DLR integration so we can offer IronPython support.

Does that seem like it would solve your problem?

I wasn’t going to post anything on the forums about it until Monday (since I’m still writing the documentation and code examples), but I posted the first beta version of GPU.NET on our website last night (see the link in my sig. below). It does RTCG from .NET, so you could have your fancy GUI and you don’t have to ship nvcc or the CUDA runtime with your app. Our nVidia plugin generates PTX, so the device code is pretty much the same as what you’d get out of nvcc.

As far as scripting languages go, what about F# interactive? You could have your customers write standard F# scripts against an API exposed by your program, then it’d all get compiled to .NET and run through GPU.NET. (NOTE: I’m still working on F# support, so the beta won’t let you do this, yet.) I’m also looking into what we need to do for DLR integration so we can offer IronPython support.

Does that seem like it would solve your problem?