CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python

I’m having this error when trying to run this code in Python using CUDA. I’m following this tutorial but i’m trying it in Windows 7 x64 machine.

In fact, I run check_cuda() and all tests passed. Can anyone help me what is the exact issue here.

My Code:

import numpy as np
from timeit import default_timer as timer
from numbapro import vectorize, cuda

@vectorize(['float64(float64, float64)'], target='gpu')
def VectorAdd(a, b):
    return a + b

def main():
    N = 32000000

A = np.ones(N, dtype=np.float64)
B = np.ones(N, dtype=np.float64)
C = np.zeros(N, dtype=np.float64)

start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start

print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))

print("VectorAdd took %f seconds" % vectoradd_time)

if __name__ == '__main__':
    main()

Error Message:

---------------------------------------------------------------------------
CudaAPIError                              Traceback (most recent call last)
<ipython-input-18-2436fc2ab63a> in <module>()
      1 if __name__ == '__main__':
----> 2     main()

<ipython-input-17-64de53fdbe77> in main()
      7 
      8     start = timer()
----> 9     C = VectorAdd(A, B)
     10     vectoradd_time = timer() - start
     11 

C:\Anaconda2\lib\site-packages\numba\cuda\dispatcher.pyc in __call__(self, *args, **kws)
     93                       the input arguments.
     94         """
---> 95         return CUDAUFuncMechanism.call(self.functions, args, kws)
     96 
     97     def reduce(self, arg, stream=0):

C:\Anaconda2\lib\site-packages\numba\npyufunc\deviceufunc.pyc in call(cls, typemap, args, kws)
    297 
    298             devarys.extend([devout])
--> 299             cr.launch(func, shape[0], stream, devarys)
    300 
    301             if any_device:

C:\Anaconda2\lib\site-packages\numba\cuda\dispatcher.pyc in launch(self, func, count, stream, args)
    202 
    203     def launch(self, func, count, stream, args):
--> 204         func.forall(count, stream=stream)(*args)
    205 
    206     def is_device_array(self, obj):

C:\Anaconda2\lib\site-packages\numba\cuda\compiler.pyc in __call__(self, *args)
    193 
    194         return kernel.configure(blkct, tpb, stream=self.stream,
--> 195                                 sharedmem=self.sharedmem)(*args)
    196 
    197 class CUDAKernelBase(object):

C:\Anaconda2\lib\site-packages\numba\cuda\compiler.pyc in __call__(self, *args, **kwargs)
    357                           blockdim=self.blockdim,
    358                           stream=self.stream,
--> 359                           sharedmem=self.sharedmem)
    360 
    361     def bind(self):

C:\Anaconda2\lib\site-packages\numba\cuda\compiler.pyc in _kernel_call(self, args, griddim, blockdim, stream, sharedmem)
    431                                    sharedmem=sharedmem)
    432         # Invoke kernel
--> 433         cu_func(*kernelargs)
    434 
    435         if self.debug:

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in __call__(self, *args)
   1114 
   1115         launch_kernel(self.handle, self.griddim, self.blockdim,
-> 1116                       self.sharedmem, streamhandle, args)
   1117 
   1118     @property

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in launch_kernel(cufunc_handle, griddim, blockdim, sharedmem, hstream, args)
   1158                           hstream,
   1159                           params,
-> 1160                           None)
   1161 
   1162 

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in safe_cuda_api_call(*args)
    220         def safe_cuda_api_call(*args):
    221             retcode = libfn(*args)
--> 222             self._check_error(fname, retcode)
    223 
    224         setattr(self, fname, safe_cuda_api_call)

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in _check_error(self, fname, retcode)
    250             errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
    251             msg = "Call to %s results in %s" % (fname, errname)
--> 252             raise CudaAPIError(retcode, msg)
    253 
    254     def get_device(self, devnum=0):

CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

It may be that you’re not compiling for the correct GPU type.

If the code starts working when you change N to 32000:

N = 32000

it would tend to confirm this theory.

what gpu do you have? (check_cuda should report this)

Thanks a lot txbob! It works!!!

But I have a question which is after I run the program, Vectorize Implementation takes more execution time when compared to CPU execution. Is that because my GPU machine takes more time to bring GPU compiled binaries to RAM and then to CPU?

FYI, I have Quadro 4000 cuda device.

Not every possible thing you could imagine will run faster on the GPU than on the CPU. This topic is covered extensively on other questions across a great many forums, such as this one and stackoverflow, so I’ll be brief. A necessary condition for GPU acceleration to be advantageous is that the benefit of doing work on the GPU exceeds the cost of getting the data there.

Unpacking this statement for a specific case involves various considerations. In this particular case, vector add is a commonly used example for learning purposes, but by itself does not usually satisfy the previously stated necessary condition. The work intensity being done on the GPU is not sufficient to make the GPU run dramatically faster than the CPU, and the GPU benefit in this case (if any) is not large enough to outweigh the cost of transferring the various vectors to/from the device.

If what you intend to do on the GPU is a single vector add and nothing else, you should almost certainly not waste your time with GPU programming.

I understand your explanation and this is my first time in this forum by the way. The purpose of trying this is to see how this technology fit for training Machine Learning Algorithms, In particularly, Neural Net In my case, not just adding vectors of course. Anyway, Thanks for your help!

“It may be that you’re not compiling for the correct GPU type.”

I am also facing the same issue i.e for lower value my code runs but as I increase the value of N, program crashes. I use Nvidia Quadro 4000 GPU. Can you please elaborate on “compiling for correct GPU type”? Do I have to specify some command while compiling?
I use python and use

nvprof python myCode.py

to run.

Quadro 4000 is a compute capability 2.0 GPU, so the maximum limit for such an operation would be 65535

If you are using a recent version of CUDA such as CUDA 7, 7.5 or CUDA 8, the default compilation is for cc 2.0 so there is nothing you need to change.

Instead you will need to modify your code to take this limit into account. For example, in your case, on a cc 2.0 device where you want to handle a vector larger than 32000 or 65535, you could rewrite your vectorize routine using guvectorize:

[url]http://numba.pydata.org/numba-doc/0.23.1/cuda/ufunc.html[/url]

Note that numba is not an NVIDIA product; it is published by continuum.io

In my case, at N=60000000 program runs fine but for N=70000000 it returns following error:
(I am using 1024 theads per block)

raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

As per your answer if I vectorize my routine using guvectorize then this will be resolved. Am I right?

I also read that pyCuda can be used to run algorithms on GPU. Which one should I use among Numba and pyCuda to make my algorithm parallel. Is there any major difference between the two? Which one provides more functionality on the GPU?
My understanding is that we need to pass only numpy arrays to the GPU while using Numba. If we use pyCuda, can it take other inputs as well like lists?

Which one should I opt for?

yes, 60000000/1024 is less than the 65535 limit, but 70000000/1024 is greater than the 65535 limit.

I think you should be able to use guvectorize to work around this limit in this case, by having each thread perform the vector add for multiple elements of your vector. you could also just write a numba cuda kernel using numba cuda.jit capability.

numba allows you to write the kernel code in python.

pycuda will generally require you to write the kernel code in CUDA C++

pycuda gives more kernel code capability/flexibility than numba

there are a lot of other differences between numba and pycuda, I wouldn’t be able to summarize all the differences here.

Thanks a lot for your help. Appreciate it.

Hi I think I am having a similar problem. I have tried to implement the 2nd problem in th e example of the numba cuda section (1D Heat Equation). I am trying to run an array size of 20,001 which fails. When I use an array size of 10,001 it works fine. I have tried to manually set the number of blocks by setting the number of threadperblock to 1024 and solving for the number of blocks, giving a grid size of 20. Below is the whole code:

import numpy as np
from numba import cuda
from matplotlib import pyplot as plt
import time

start = time.time()

#Setting Domain
size = 20001
data = np.zeros(size)


#Setting initial conditions
data[500] = 100

#Sending data to device (GPU)
buf_0 = cuda.to_device(data)
buf_1 = cuda.device_array_like(buf_0)

niter = 1000

#Solving the 1D Heat equaiton
@cuda.jit
def oneDHeat(buf_0, buf_1, timesteps, k):

    i = cuda.grid(1) #Create 1D grid in GPU

    if i >= len(buf_0):
        return
    
    #Prepare to do a grid-wide sync
    grid = cuda.cg.this_grid()

    for step in range(timesteps):
        # Select the buffer from the previous timestep
        if (step % 2) == 0:
            data = buf_0
            next_data = buf_1
        else:
            data = buf_1
            next_data = buf_0

        # Get the current temperature associated with this point
        curr_temp = data[i]

        # Applying 1st order central difference method
        if i == 0:
            # Left wall is held at T = 0
            next_temp = curr_temp + k * (data[i + 1] - (2 * curr_temp))
        elif i == len(data) - 1:
            # Right wall is held at T = 0
            next_temp = curr_temp + k * (data[i - 1] - (2 * curr_temp))
        else:
            # Interior points are a weighted average of their neighbors
            next_temp = curr_temp + k * (
                data[i - 1] - (2 * curr_temp) + data[i + 1]
            )

        #Store the value in array n+1 for use in next iteration
        next_data[i] = next_temp

        #Wait for completion of whole grid soo that there are no mistakes because of different speeds
        grid.sync()

threadsperblock = 1024
blockspergrid = (size + (threadsperblock-1)) // threadsperblock

#Calling the kernel
oneDHeat[blockspergrid, threadsperblock](buf_0,buf_1, niter, 0.25)

final_res = buf_0.copy_to_host()

x_pos = np.arange(size)

end = time.time()

print(end-start)

Thanks for the help!

maybe grid size of 20 blocks is too large for a cooperative grid launch.