Combine pycuda and cupy

Hi all,
I’m trying to do some operations on pyCuda and Cupy. Separately, both are working fine, but when I try to use pyCuda after Cupy, I got the following error:

pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid resource handle

Do you know how I could fix it? Here is a simplified code to reproduce the error:

import numpy as np
import cupy as cp
from scipy.signal import butter
import cusignal

import pycuda.autoinit
import pycuda.gpuarray as gpu_array
from pycuda.compiler import SourceModule


mod = SourceModule("""
__global__ void test(float *data, int nb_points)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < nb_points) {
        data[i] *= 2;
    }
}
""")


# --- Module
test = mod.get_function("test")


# --- Variables
nbe = 128
nbt = 3000
n_points = nbe * nbt
data = np.random.randn(nbe * nbt).reshape(nbe, nbt)

# --- GPU shapes
b_dim = (1024, 1, 1)
dx, mx = divmod(n_points, 1024)
g_dim = ((dx + (mx > 0)), 1)

# --- Filtering using Cupy
sos = butter(5, [0.1, 0.9], btype='band', output='sos')
cp_data = cp.asarray(data)
cp_sos = cp.asarray(sos)
cp_filtered = cusignal.sosfilt(cp_sos, cp_data)
filtered = cp.asnumpy(cp_filtered)

# --- Operation using pyCuda
gpu_data = gpu_array.to_gpu(filtered.ravel().astype(np.float32))
test(gpu_data, np.int32(n_points), block=b_dim, grid=g_dim)

Thanks!

I’ve had issues with CuPy and PyCUDA compatibility in the past.
Why don’t you try a CuPy Raw Kernel?

https://docs.cupy.dev/en/stable/user_guide/kernel.html

If I understand your code correctly, you could easily use a CuPy Elementwise Kernel
https://docs.cupy.dev/en/stable/reference/generated/cupy.ElementwiseKernel.html#cupy.ElementwiseKernel

More resoures

Also, you could leavecp_filtered on the GPU, run cp.ravel, and then pass to custom kernel.

Thanks for your reply!
I am using pyCuda SourceModules instead of cupy RawKernels because it is much faster for me… In the simple code below, I got 8ms with pycuda vs 1s using cupy… Maybe I’m doing something wrong tho…?
However, cusignal library is really useful for me because it has already implemented FIR filters, which I need on GPU along my Pycuda algorithms…

Speed tests:

import pycuda.autoinit
import pycuda.gpuarray as gpu_array
import numpy as np
import cupy as cp
import time

from pycuda.compiler import SourceModule


pycuda_mod = SourceModule("""
__global__ void test(float *data, const int nb_points)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < nb_points) {
        data[i] *= 2;
    }
}
""")

cp_test = cp.RawKernel(r'''
extern "C"
__global__ void test(float *data, const int nb_points)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < nb_points) {
        data[i] *= 2;
    }
}''', 'test')


SIZE = 524288
NB_THREADS_PER_BLOCKS = 1024


# --- Numpy variables
data = np.random.randn(SIZE)
b_dim = (NB_THREADS_PER_BLOCKS, 1, 1)
dx, mx = divmod(SIZE, NB_THREADS_PER_BLOCKS)
g_dim = ((dx + (mx > 0)), 1)

# --- As GPU supported types
pycuda_data = gpu_array.to_gpu(data.astype(np.float32))
cupy_data = cp.asarray(data, dtype=cp.float32)

# --- Pycuda
pycuda_test = pycuda_mod.get_function("test")
start = time.time()
pycuda_test(pycuda_data, np.int32(SIZE),
            block=b_dim, grid=g_dim)
pycuda_output = pycuda_data.get()
print(time.time() - start)

# --- Cupy
start = time.time()
cp_test(b_dim, g_dim, (cupy_data, SIZE))
cp_output = cp.asnumpy(cupy_data)
print(time.time() - start)

assert ((pycuda_output - cp_output) == 0).all()

(and same here, I got the same error if I swap the Cupy / pyCuda sections)

For you PyCUDA timing, can you include pycuda_test = pycuda_mod.get_function("test") inside/after start = time.time()

Remember that CUDA kernels (i.e., C/C++ code) must be compiled before it can be used in Python.
Therefore, you’re gonna see a performance impact on first call. You should time the first pass and execute and average a couple passes to get a true benchmark. Checkout my presentation “GPU Acceleration in Python” for best practices.

1 Like

Hm actually that makes sense, I thought the compilation was done when calling the RawKernel class. I did the average over a couple passes (skipping the first ones) and indeed, it gives similar performances.
Hmm. I might update all my codes to make them work on Cupy then, thanks! Objectively, is there any drawback in using Cupy over PyCuda?

I’ve never ran into any drawbacks with CuPy and find their support to be exceptional!

Also, if you’re more of a Python developer you can checkout CuPy’s JIT kernels.

Thanks! I have some work to do then… :D

You can also check out some of the things we did with cuSignal (Example: cusignal/_spectral_cuda.py at branch-21.08 · rapidsai/cusignal · GitHub)