pyCUDA best practice to keep C++ kernels separate from python code

Hi all!

Sorry if this is a common beginners question, but I’d love to get the community view on how to use pyCuda in a context where writing the kernel source (c++) code in a python string is not viable.

Basically, my team is looking for a clean way to migrate test cases and development flows to be python-based, but still code kernels in C++ for inclusion into production environments that are pure C++ cuda/dpdk further down the CI/CD pipeline. (Also, we do love to see that C++ code in an IDE, as usual…)

Ideally, I’d like kernels in separate .cu source files that are built with nvcc prior to spawning the python3 runtime. Haven’t found an example with this setup – tutorials and examples all seem to have pycuda.compiler.SourceModule take a string and run nvcc from within the python process, rather than taking a path to a compiled cuda object file.

Now, I guess the other option is to put all kernels in headers and pass them via include-path to SourceModule… Is this the preferred option, or is there some other way?

Cheers! /Dag

Hi Dag,

If your kernels and C++ are precompiled into a shared library, you can write Python bindings to the C++ functions through a variety of different technologies. For example in RAPIDS cudf (https://github.com/rapidsai/cudf):

There’s numerous alternatives to Cython such as ctypes (https://docs.python.org/3/library/ctypes.html), cffi (https://cffi.readthedocs.io/en/latest/), pybind11 (https://pybind11.readthedocs.io/en/stable/), etc. that can all be used to bind Python to C/C++ functions.

Does that answer your question?

Cython is familiar and makes sense on a basic level. Thanks!

It’s maybe a bit secondary, but could Cython really be used with pyCUDA??

What I had in mind was instead of:

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

we would have something like

from pycuda.compiler import BinaryModule
mod = BinaryModule("foo.cubin")
multiply_them = mod.get_function("multiply_them")

Clearly, some parts are missing here, even though the basic linkage mechanism is the same …

If this isn’t under consideration we’ll write a custom wrapper with Cython for our testcase driver. No problem!

/Dag

No, you can’t use Cython to directly load and access a cubin. You would need some additional driver API code (at a minimum, calls into the CUDA driver API library, which could probably be done using python ctypes/cython). That is why the previous response said “If your kernels and C++ are precompiled into a shared library”. If you create a shared library that is used in a CUDA program, by definition there is no device code linking (or, basically, exposure) over the interface.

pycuda is built on the CUDA driver API. With a bit of effort you should be able to intermix calls to the driver API for anything that pycuda can’t do directly. If you do this yourself, then of course you’ll need to integrate some C++ host code (calls) to call the CUDA driver API (library), because that is a C++ (host) library. There are no python bindings for it directly that I know of. However, pycuda is pretty well engineered, in my opinion, and if you poke around in pycuda, I wouldn’t be surprised if you find effectively bindings for the things you need, i.e. load a cubin, extract the kernel, etc. For instance, see here:

[url]https://documen.tician.de/pycuda/driver.html[/url]

check the “Code on the Device: Modules and Functions” section.

pycuda isn’t a NVIDIA product. It’s created and maintained by someone else. So asking “if this isn’t under consideration” on these forums might not make sense, if you are looking for changes to pycuda.

1 Like

Ok, sounds good. Sorry for suggesting you consider this as a feature – I found pyCUDA via Nvida dev pages and didn’t really think about who maintains it. /d

As Robert Crovella mentioned, it seems pycuda has a nice interface to load Cubin.

For me the refactoring from python string (pycuda compilation) to nvcc compilation and loading cubin was as follows:

  • copying code to external myKernel.cu file # note: I have to use extern “C” for pycuda reasons
#include <cuda_fp16.h>
extern "C"{
    __global__ void my_kernel(...)
}
  • compiling to cubin with nvcc # for arch_sm on your hardware, see nvidia docs. To avoid pycuda error you have to specify arch_sm.
nvcc -Xptxas -O3,-v -arch=sm_75 -cubin ./cuda_kernels.cu
import pycuda.driver as cuda

    def load_kernels(self, cubin_file):
        mod = cuda.module_from_file(cubin_file)
        func_my_kernel = mod.get_function("my_kernel")
        return func_my_kernel