pyCUDA best practice to keep C++ kernels separate from python code

Hi all!

Sorry if this is a common beginners question, but I’d love to get the community view on how to use pyCuda in a context where writing the kernel source (c++) code in a python string is not viable.

Basically, my team is looking for a clean way to migrate test cases and development flows to be python-based, but still code kernels in C++ for inclusion into production environments that are pure C++ cuda/dpdk further down the CI/CD pipeline. (Also, we do love to see that C++ code in an IDE, as usual…)

Ideally, I’d like kernels in separate .cu source files that are built with nvcc prior to spawning the python3 runtime. Haven’t found an example with this setup – tutorials and examples all seem to have pycuda.compiler.SourceModule take a string and run nvcc from within the python process, rather than taking a path to a compiled cuda object file.

Now, I guess the other option is to put all kernels in headers and pass them via include-path to SourceModule… Is this the preferred option, or is there some other way?

Cheers! /Dag

Hi Dag,

If your kernels and C++ are precompiled into a shared library, you can write Python bindings to the C++ functions through a variety of different technologies. For example in RAPIDS cudf (

There’s numerous alternatives to Cython such as ctypes (, cffi (, pybind11 (, etc. that can all be used to bind Python to C/C++ functions.

Does that answer your question?

Cython is familiar and makes sense on a basic level. Thanks!

It’s maybe a bit secondary, but could Cython really be used with pyCUDA??

What I had in mind was instead of:

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];

multiply_them = mod.get_function("multiply_them")

we would have something like

from pycuda.compiler import BinaryModule
mod = BinaryModule("foo.cubin")
multiply_them = mod.get_function("multiply_them")

Clearly, some parts are missing here, even though the basic linkage mechanism is the same …

If this isn’t under consideration we’ll write a custom wrapper with Cython for our testcase driver. No problem!


No, you can’t use Cython to directly load and access a cubin. You would need some additional driver API code (at a minimum, calls into the CUDA driver API library, which could probably be done using python ctypes/cython). That is why the previous response said “If your kernels and C++ are precompiled into a shared library”. If you create a shared library that is used in a CUDA program, by definition there is no device code linking (or, basically, exposure) over the interface.

pycuda is built on the CUDA driver API. With a bit of effort you should be able to intermix calls to the driver API for anything that pycuda can’t do directly. If you do this yourself, then of course you’ll need to integrate some C++ host code (calls) to call the CUDA driver API (library), because that is a C++ (host) library. There are no python bindings for it directly that I know of. However, pycuda is pretty well engineered, in my opinion, and if you poke around in pycuda, I wouldn’t be surprised if you find effectively bindings for the things you need, i.e. load a cubin, extract the kernel, etc. For instance, see here:

check the “Code on the Device: Modules and Functions” section.

pycuda isn’t a NVIDIA product. It’s created and maintained by someone else. So asking “if this isn’t under consideration” on these forums might not make sense, if you are looking for changes to pycuda.

Ok, sounds good. Sorry for suggesting you consider this as a feature – I found pyCUDA via Nvida dev pages and didn’t really think about who maintains it. /d