Interfaces for CUDA programs and portability.

Hello,

Is there anyway to right a simple interface for the CUDA programs?

My programs solve partial differential equation in a pretty standard way, but from time to time we change slightly the model. This does not change the general way to solve the problem, but just some simple parts like replace a function axx/2+bxxxxx/4 to cxx/2+dxxx/3+fxxxx*x/4. It is a little tedious now for me and also for the other people I give the code to do changes and I want write in a way that the user could do some small changes to the code without heaving to worry about the rest.

The ideal way is to be able to input our physics problem as a symbolic function similar to the way it is in maple or mathematica and then just run the program.

If there is a fixed number of choices, the easiest way would be to pre-build all the functions and invoke the appropriate one via function pointers.

If the choice of function is arbitrary, you could write a small interpreter that evaluates functions based on a representation of your choice. That is bound to be quite slow, though. The advantage is that application users do not have to build any code, and do not have to have access to the CUDA tool chain.

If you need a high-performance solution, you could use separate compilation for your application, place all the user-defined functions in one file, and only re-build and re-link that file. However, this requires that users have access to the appropriate CUDA tool chain.

If you are willing to dynamically (at run time) build entire kernels, rather than just functions, you could compile your kernel to PTX using your own a custom-tailored “compiler”, and load and JIT the resulting PTX code. Alternatively, shell out to run NVCC on CUDA source, then load and JIT the PTX. I have seen both approaches work in real-life customer code.

Hello,

Thank you for your reply. This sounds very complicated. In practice the number of choices it is limited I could reduce the amount of function which need to be edited to one.

I was thinking about possibility to have a python or C++ interface and the user would be able to edit some kind of template and at the run time the cuda functions would be recompiled. So the user would write in the function in some ‘template’ similar to the symbolic functions. The portability only refers to allow a user to edit simple things without worrying about anything else.
Can the #define be used to define cuda functions?

Except for the variant using a custom compiler (e.g. using a domain specific language), all these approaches seem straightforward, assuming the goal is to support simple, user-definable numerical expressions.

If the number of possible functions is limited, why no build all the versions, then invoke the desired one via function pointer? If you look at the batched solver code posted to the registered developer website, it builds tens of functions from a few templates, then invokes the appropriate one based on matrix dimensions, through a function pointer.

From what I know, the dynamic CUDA compilation functionality for Python is implemented along the lines of the last paragraph in my previous post. The reason such run-time compilation needs to occur with kernel rather than function granularity is that there is a JIT compiler for PTX but not a JIT linker.

Thanks. I will check what I can do. It will be very helpful to get some pointers to where I should start this. It is my first time and I have no formal programming training.

Not knowing anything about your application I am not sure which way to point. Since you mentioned Python, have you had a chance to look at PyCUDA:

https://developer.nvidia.com/pycuda

I don’t have any personal experience with it, but the list of features specifically calls out this:

Enables run-time code generation (RTCG) for flexible, fast, automatically tuned codes.

I have used PyCUDA a lot, and one thing to note is that you will need nvcc to be available on the user’s machine. PyCUDA does make it very easy to take CUDA code defined in a string variable, compile it, and load it onto the GPU in just a few lines of code.

To combine user-provided code with existing kernels I have used simple string substitution in the past, though the PyCUDA author also describes constructing functions using codepy, which I have no experience with.

With a little bit of care, you can also make Numba (disclaimer: I just started working for this company) do what you are describing, but with functions written in Python syntax rather than CUDA C. It isn’t as flexible as writing CUDA C directly in PyCUDA, but avoids the need for the entire CUDA C toolchain. (Numba uses NVVM to generate PTX code which the GPU driver can load and JIT compile directly.)

Hello,

Thank you both for the reply. I did not give more details because I was not sure my self what I need. But I think I figured out. As a background we have time dependent partial differential equations on a finite rectangular domain, which we solve using a straight forward spectral method. The problem is iterative and each iteration ( time step) we have a real space calculation, followed by forward cufft calls, convolutuin and invers cufft. The problem is each model has slightly difference and I am now maintaing several codes which makess every tedious.

I plan to have 3 level approach.
The first level are cuda functions/cufft calls/ memcpy objects (wrappers)
Second level (python) to use these basic objects (wrappers) to construct the solvers.
Third level the actual running script which would be a few lines only, doing the actual job and at the end use the python to also do the analysis.

The symbolic part is more far away purpose. In far far future also a graphical interface could be implemented on top for the second and third level. I need this because, write now I have many codes. Fpr each (physical) model I have a code for the phase diagram and another for some other purposes. I have codes for different models, some of them binary and I need to port several more. Plus I want to add cou only codes for comparisons + mpi+fftw and also mpi+cufft based functions. to make things worse sme of the codes are in FORTRAN (cpu only).
This a package for me and also if it turns out right for other people in the fied who will be happy to have a relatively small time to set-up a working solver and switch easily between different models.