Trying to access my NVIDIA GeForce GTX 1650 SUPER GPU via python

I found a simple little code on the internet, but I keep getting warnings about the values of the size of the arrays. I’ve tried a number of different sizes, but I always get warnings. I’m not sure what I’m doing wrong. I’ve attached the output from my NSIGHT Compute screens on my graphics card, but I can’t figure out what to put for the array sizes. Can someone help me on this or point me to some training? I’m very new to CUDA.

from __future__ import division
from numba import cuda
import numpy
import math


# CUDA kernel
@cuda.jit
def my_kernel(io_array):
    pos = cuda.grid(1)
    if pos < io_array.size:
        io_array[pos] *= 2   # do the computation


# Host code
data = numpy.ones(2048)
threadsperblock = 16
blockspergrid = math.ceil(data.shape[0] / threadsperblock)
my_kernel[blockspergrid, threadsperblock](data)
print(data)

The output is this:

[2. 2. 2. … 2. 2. 2.]
…anaconda3\envs\Nvidia\lib\site-packages\numba\cuda\cudadrv\devicearray.py:885: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
warn(NumbaPerformanceWarning(msg))

Process finished with exit code 0

I assume you are referring to this:

That isn’t really predicated on the size of the array in question.

You’re not doing anything wrong, in my opinion. Your definition here:

data = numpy.ones(2048)

is a host-based numpy array. As you have demonstrated, you can certainly use that as “input” to CUDA device code, in numba CUDA. I always considered that a “feature” (i.e. a convenience that is nice to have) in numba CUDA.

In CUDA C++, traditionally, such a thing doesn’t work. It’s illegal. You must explicitly copy the data to and from the GPU. Naturally, this copying takes time; its not “free”.

What numba CUDA does under the hood for you is copy that numpy array to a device memory array that is accessible to CUDA device code (kernel code). On the other end (after the kernel is finished) numba also copies that data back to the original numpy array on the host. This makes all your host code work as one might typically expect, if they were a python programmer and generally unaware of the asynchronous nature of GPU computing.

However, it does incur that copy cost. People who don’t understand any of this might assume that the cost to copy data is “free”, and then wonder why a trivial kernel takes so long to execute (in numba CUDA). It’s taking a long time to execute, because in numba CUDA, when using host numpy arrays for input/output, the kernel launch is burdened with the additional cost of data copying.

That is the reason for the “performance warning”. To make it really obvious to people who are unaware of the mechanism, that what they are doing is not “free”.

Numba didn’t always warn this way. Somewhere along the way, perhaps somebody got tired of people posting questions asking why their kernels are taking so long to execute. As a result of trying to address that, we now have questions asking what the warning means. So you can’t win, really. But maybe its better to be really obvious and verbose about everything.

I’m sure one method to eliminate this warning is to manually/explicitly do the copying yourself.

The general necessity to copy data to and from a CUDA GPU is covered in the first few sections of this online training series although that has CUDA C++ in view. You can find numba CUDA online tutorials with a bit of searching. Here is one.

Also, when posting code here on this forum, its preferred that you format it correctly. One simple method to do that is when you are editing your post (click the pencil icon below your post), select the code, then click the </> button at the top of the edit window, then save your changes. I’ve done it for you this time.

Thanks. That helped a lot. One small change fixed that issue. I now coded this:

my_kernel[blockspergrid, threadsperblock](cuda.to_device(data))

While that change will indeed eliminate the warning, that change (by itself, only) will result in a program that prints:

[1. 1. 1. … 1. 1. 1.]

not

[2. 2. 2. … 2. 2. 2.]

You have effectively lost the calculation result of the kernel. So that wouldn’t be my recommendation, but to each their own.

One alternative, if you wanted to see the printout consisting of 2’s, would be as follows:

d_data = cuda.to_device(data)
my_kernel[blockspergrid, threadsperblock](d_data)
print(d_data.copy_to_host())

To closely mimic the original program, if you wanted to have the results end up in data, you could do:

d_data = cuda.to_device(data)
my_kernel[blockspergrid, threadsperblock](d_data)
d_data.copy_to_host(data)
print(data)

And now you have a pretty good definition of what numba is doing “under the hood” in the original case.

Yep. I messed up. Thanks for pointing that out to me. Obviously I’m printing the host version not the device version, which was modified. Thanks again. I have a lot to learn. I’ll check out that training. And thanks for the code fix. I appreciate that.