Combine pycuda and cupy

pecarlat · July 22, 2021, 8:16am

Hi all,
I’m trying to do some operations on pyCuda and Cupy. Separately, both are working fine, but when I try to use pyCuda after Cupy, I got the following error:

pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid resource handle

Do you know how I could fix it? Here is a simplified code to reproduce the error:

import numpy as np
import cupy as cp
from scipy.signal import butter
import cusignal

import pycuda.autoinit
import pycuda.gpuarray as gpu_array
from pycuda.compiler import SourceModule


mod = SourceModule("""
__global__ void test(float *data, int nb_points)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < nb_points) {
        data[i] *= 2;
    }
}
""")


# --- Module
test = mod.get_function("test")


# --- Variables
nbe = 128
nbt = 3000
n_points = nbe * nbt
data = np.random.randn(nbe * nbt).reshape(nbe, nbt)

# --- GPU shapes
b_dim = (1024, 1, 1)
dx, mx = divmod(n_points, 1024)
g_dim = ((dx + (mx > 0)), 1)

# --- Filtering using Cupy
sos = butter(5, [0.1, 0.9], btype='band', output='sos')
cp_data = cp.asarray(data)
cp_sos = cp.asarray(sos)
cp_filtered = cusignal.sosfilt(cp_sos, cp_data)
filtered = cp.asnumpy(cp_filtered)

# --- Operation using pyCuda
gpu_data = gpu_array.to_gpu(filtered.ravel().astype(np.float32))
test(gpu_data, np.int32(n_points), block=b_dim, grid=g_dim)

Thanks!

mnicely · July 22, 2021, 1:41pm

I’ve had issues with CuPy and PyCUDA compatibility in the past.
Why don’t you try a CuPy Raw Kernel?

https://docs.cupy.dev/en/stable/user_guide/kernel.html

If I understand your code correctly, you could easily use a CuPy Elementwise Kernel
https://docs.cupy.dev/en/stable/reference/generated/cupy.ElementwiseKernel.html#cupy.ElementwiseKernel

More resoures

Also, you could leavecp_filtered on the GPU, run cp.ravel, and then pass to custom kernel.

pecarlat · July 22, 2021, 5:27pm

Thanks for your reply!
I am using pyCuda SourceModules instead of cupy RawKernels because it is much faster for me… In the simple code below, I got 8ms with pycuda vs 1s using cupy… Maybe I’m doing something wrong tho…?
However, cusignal library is really useful for me because it has already implemented FIR filters, which I need on GPU along my Pycuda algorithms…

Speed tests:

import pycuda.autoinit
import pycuda.gpuarray as gpu_array
import numpy as np
import cupy as cp
import time

from pycuda.compiler import SourceModule


pycuda_mod = SourceModule("""
__global__ void test(float *data, const int nb_points)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < nb_points) {
        data[i] *= 2;
    }
}
""")

cp_test = cp.RawKernel(r'''
extern "C"
__global__ void test(float *data, const int nb_points)
{
    const int i = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < nb_points) {
        data[i] *= 2;
    }
}''', 'test')


SIZE = 524288
NB_THREADS_PER_BLOCKS = 1024


# --- Numpy variables
data = np.random.randn(SIZE)
b_dim = (NB_THREADS_PER_BLOCKS, 1, 1)
dx, mx = divmod(SIZE, NB_THREADS_PER_BLOCKS)
g_dim = ((dx + (mx > 0)), 1)

# --- As GPU supported types
pycuda_data = gpu_array.to_gpu(data.astype(np.float32))
cupy_data = cp.asarray(data, dtype=cp.float32)

# --- Pycuda
pycuda_test = pycuda_mod.get_function("test")
start = time.time()
pycuda_test(pycuda_data, np.int32(SIZE),
            block=b_dim, grid=g_dim)
pycuda_output = pycuda_data.get()
print(time.time() - start)

# --- Cupy
start = time.time()
cp_test(b_dim, g_dim, (cupy_data, SIZE))
cp_output = cp.asnumpy(cupy_data)
print(time.time() - start)

assert ((pycuda_output - cp_output) == 0).all()

(and same here, I got the same error if I swap the Cupy / pyCuda sections)

mnicely · July 22, 2021, 5:46pm

For you PyCUDA timing, can you include pycuda_test = pycuda_mod.get_function("test") inside/after start = time.time()

Remember that CUDA kernels (i.e., C/C++ code) must be compiled before it can be used in Python.
Therefore, you’re gonna see a performance impact on first call. You should time the first pass and execute and average a couple passes to get a true benchmark. Checkout my presentation “GPU Acceleration in Python” for best practices.

pecarlat · July 22, 2021, 6:21pm

Hm actually that makes sense, I thought the compilation was done when calling the RawKernel class. I did the average over a couple passes (skipping the first ones) and indeed, it gives similar performances.
Hmm. I might update all my codes to make them work on Cupy then, thanks! Objectively, is there any drawback in using Cupy over PyCuda?

mnicely · July 22, 2021, 7:02pm

I’ve never ran into any drawbacks with CuPy and find their support to be exceptional!

Also, if you’re more of a Python developer you can checkout CuPy’s JIT kernels.

pecarlat · July 22, 2021, 7:26pm

Thanks! I have some work to do then… :D

mnicely · July 22, 2021, 9:58pm

You can also check out some of the things we did with cuSignal (Example: cusignal/_spectral_cuda.py at branch-21.08 · rapidsai/cusignal · GitHub)

system · September 20, 2021, 9:58pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.