Passing scalar to functions (cupy & pycuda) - scalar multiplication of a vector

Hi,

I just started to learn CUDA and I use cupy/pycuda to interface.

I work on a simple example: scalar multiplication of a vector:

.cu code:

#define _I ( threadIdx.x + blockIdx.x * blockDim.x )
extern "C" __global__
        void scalar_multiply_kernel(float *vec, float scalar)
        {
                int i = _I;
                vec[i] = scalar * vec[i];
        }       

In pycuda this works fine with:

scalar_multiply_gpu(testvec_gpu, np.float32(2), block=(1024, 1, 1), grid=(int(N/1024)+1, 1, 1))

Where N is the length of the vector.

In cupy this approach does not seem to work, and I would like to understand why:

Somehow the scalar in cupy doesn’t get passed. I have to convert the scalar to a N=1 length vector and in the cuda code have a pointer :

in python:

grid = (int(N/1024)+1, 1, 1)
block = (1024, 1, 1)
args = (testvec_gpu, cp.asarray([2.0]).astype(cp.float32))
scalar_multiply_gpu(grid, block, args=args)

in cu file:

#define _I ( threadIdx.x + blockIdx.x * blockDim.x )
extern "C" __global__
        void scalar_multiply_kernel(float *vec, float *scalar)
        {
                int i = _I;
                vec[i] = scalar[0] * vec[i];
        }

I’m sorry if this question may seem to be trivial, but if somebody has a good explanation, that will help me a lot.

Thanks

You haven’t provided a complete example. So I don’t really know all the errors you may be making. Your kernel definition is certainly broken for the way you are creating your grid. Your kernel can make out-of-bounds access for the way you are sizing your grid. (That statement is true for both pycuda and cupy versions.) Furthermore, there isn’t any reason in cupy that you can’t pass a scalar kernel argument.

Following the guide here the following works for me:

$ cat t67.py
import numpy as np
import cupy as cp

scalar_multiply_gpu = cp.RawKernel(r'''
extern "C" __global__
void scalar_multiply_kernel(float *vec, float scalar, int N)
    {
      int i = threadIdx.x+blockDim.x*blockIdx.x;
      if (i < N)
        vec[i] = scalar * vec[i];
    }
''', 'scalar_multiply_kernel')

N = np.int32(2048)
testvec_gpu = cp.ones((N,),dtype=cp.float32)
grid = (int(N/1024)+1, 1, 1)
block = (1024, 1, 1)
scalar_multiply_gpu(grid, block, (testvec_gpu, np.float32(2.0), N))
print(cp.asnumpy(testvec_gpu))
$ python t67.py
[2. 2. 2. ... 2. 2. 2.]
$

Many, thanks, it does work for me now as well.

Would it be better to define the block/grid like this:
grid = (N, 1, 1)
block = (1, 1, 1)

?

No, not from a performance perspective (but you can still get the correct answer that way). To understand why requires some coherent exposure to CUDA programming. But there are many questions already asked like this. If you take the first 3 or 4 sessions here you will understand better why.

Thank you very very much! This is exactly what I need to study!