Hi all,
I’m trying to do some operations on pyCuda and Cupy. Separately, both are working fine, but when I try to use pyCuda after Cupy, I got the following error:
Thanks for your reply!
I am using pyCuda SourceModules instead of cupy RawKernels because it is much faster for me… In the simple code below, I got 8ms with pycuda vs 1s using cupy… Maybe I’m doing something wrong tho…?
However, cusignal library is really useful for me because it has already implemented FIR filters, which I need on GPU along my Pycuda algorithms…
Speed tests:
import pycuda.autoinit
import pycuda.gpuarray as gpu_array
import numpy as np
import cupy as cp
import time
from pycuda.compiler import SourceModule
pycuda_mod = SourceModule("""
__global__ void test(float *data, const int nb_points)
{
const int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < nb_points) {
data[i] *= 2;
}
}
""")
cp_test = cp.RawKernel(r'''
extern "C"
__global__ void test(float *data, const int nb_points)
{
const int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < nb_points) {
data[i] *= 2;
}
}''', 'test')
SIZE = 524288
NB_THREADS_PER_BLOCKS = 1024
# --- Numpy variables
data = np.random.randn(SIZE)
b_dim = (NB_THREADS_PER_BLOCKS, 1, 1)
dx, mx = divmod(SIZE, NB_THREADS_PER_BLOCKS)
g_dim = ((dx + (mx > 0)), 1)
# --- As GPU supported types
pycuda_data = gpu_array.to_gpu(data.astype(np.float32))
cupy_data = cp.asarray(data, dtype=cp.float32)
# --- Pycuda
pycuda_test = pycuda_mod.get_function("test")
start = time.time()
pycuda_test(pycuda_data, np.int32(SIZE),
block=b_dim, grid=g_dim)
pycuda_output = pycuda_data.get()
print(time.time() - start)
# --- Cupy
start = time.time()
cp_test(b_dim, g_dim, (cupy_data, SIZE))
cp_output = cp.asnumpy(cupy_data)
print(time.time() - start)
assert ((pycuda_output - cp_output) == 0).all()
(and same here, I got the same error if I swap the Cupy / pyCuda sections)
For you PyCUDA timing, can you include pycuda_test = pycuda_mod.get_function("test") inside/after start = time.time()
Remember that CUDA kernels (i.e., C/C++ code) must be compiled before it can be used in Python.
Therefore, you’re gonna see a performance impact on first call. You should time the first pass and execute and average a couple passes to get a true benchmark. Checkout my presentation “GPU Acceleration in Python” for best practices.
Hm actually that makes sense, I thought the compilation was done when calling the RawKernel class. I did the average over a couple passes (skipping the first ones) and indeed, it gives similar performances.
Hmm. I might update all my codes to make them work on Cupy then, thanks! Objectively, is there any drawback in using Cupy over PyCuda?