CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python

I’m having this error when trying to run this code in Python using CUDA. I’m following this tutorial but i’m trying it in Windows 7 x64 machine.

In fact, I run check_cuda() and all tests passed. Can anyone help me what is the exact issue here.

My Code:

import numpy as np
from timeit import default_timer as timer
from numbapro import vectorize, cuda

@vectorize(['float64(float64, float64)'], target='gpu')
def VectorAdd(a, b):
    return a + b

def main():
    N = 32000000

A = np.ones(N, dtype=np.float64)
B = np.ones(N, dtype=np.float64)
C = np.zeros(N, dtype=np.float64)

start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start

print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))

print("VectorAdd took %f seconds" % vectoradd_time)

if __name__ == '__main__':

Error Message:

CudaAPIError                              Traceback (most recent call last)
<ipython-input-18-2436fc2ab63a> in <module>()
      1 if __name__ == '__main__':
----> 2     main()

<ipython-input-17-64de53fdbe77> in main()
      8     start = timer()
----> 9     C = VectorAdd(A, B)
     10     vectoradd_time = timer() - start

C:\Anaconda2\lib\site-packages\numba\cuda\dispatcher.pyc in __call__(self, *args, **kws)
     93                       the input arguments.
     94         """
---> 95         return, args, kws)
     97     def reduce(self, arg, stream=0):

C:\Anaconda2\lib\site-packages\numba\npyufunc\deviceufunc.pyc in call(cls, typemap, args, kws)
    298             devarys.extend([devout])
--> 299             cr.launch(func, shape[0], stream, devarys)
    301             if any_device:

C:\Anaconda2\lib\site-packages\numba\cuda\dispatcher.pyc in launch(self, func, count, stream, args)
    203     def launch(self, func, count, stream, args):
--> 204         func.forall(count, stream=stream)(*args)
    206     def is_device_array(self, obj):

C:\Anaconda2\lib\site-packages\numba\cuda\compiler.pyc in __call__(self, *args)
    194         return kernel.configure(blkct, tpb,,
--> 195                                 sharedmem=self.sharedmem)(*args)
    197 class CUDAKernelBase(object):

C:\Anaconda2\lib\site-packages\numba\cuda\compiler.pyc in __call__(self, *args, **kwargs)
    357                           blockdim=self.blockdim,
    358                 ,
--> 359                           sharedmem=self.sharedmem)
    361     def bind(self):

C:\Anaconda2\lib\site-packages\numba\cuda\compiler.pyc in _kernel_call(self, args, griddim, blockdim, stream, sharedmem)
    431                                    sharedmem=sharedmem)
    432         # Invoke kernel
--> 433         cu_func(*kernelargs)
    435         if self.debug:

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in __call__(self, *args)
   1115         launch_kernel(self.handle, self.griddim, self.blockdim,
-> 1116                       self.sharedmem, streamhandle, args)
   1118     @property

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in launch_kernel(cufunc_handle, griddim, blockdim, sharedmem, hstream, args)
   1158                           hstream,
   1159                           params,
-> 1160                           None)

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in safe_cuda_api_call(*args)
    220         def safe_cuda_api_call(*args):
    221             retcode = libfn(*args)
--> 222             self._check_error(fname, retcode)
    224         setattr(self, fname, safe_cuda_api_call)

C:\Anaconda2\lib\site-packages\numba\cuda\cudadrv\driver.pyc in _check_error(self, fname, retcode)
    250             errname = ERROR_MAP.get(retcode, "UNKNOWN_CUDA_ERROR")
    251             msg = "Call to %s results in %s" % (fname, errname)
--> 252             raise CudaAPIError(retcode, msg)
    254     def get_device(self, devnum=0):

CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

It may be that you’re not compiling for the correct GPU type.

If the code starts working when you change N to 32000:

N = 32000

it would tend to confirm this theory.

what gpu do you have? (check_cuda should report this)

Thanks a lot txbob! It works!!!

But I have a question which is after I run the program, Vectorize Implementation takes more execution time when compared to CPU execution. Is that because my GPU machine takes more time to bring GPU compiled binaries to RAM and then to CPU?

FYI, I have Quadro 4000 cuda device.

Not every possible thing you could imagine will run faster on the GPU than on the CPU. This topic is covered extensively on other questions across a great many forums, such as this one and stackoverflow, so I’ll be brief. A necessary condition for GPU acceleration to be advantageous is that the benefit of doing work on the GPU exceeds the cost of getting the data there.

Unpacking this statement for a specific case involves various considerations. In this particular case, vector add is a commonly used example for learning purposes, but by itself does not usually satisfy the previously stated necessary condition. The work intensity being done on the GPU is not sufficient to make the GPU run dramatically faster than the CPU, and the GPU benefit in this case (if any) is not large enough to outweigh the cost of transferring the various vectors to/from the device.

If what you intend to do on the GPU is a single vector add and nothing else, you should almost certainly not waste your time with GPU programming.

I understand your explanation and this is my first time in this forum by the way. The purpose of trying this is to see how this technology fit for training Machine Learning Algorithms, In particularly, Neural Net In my case, not just adding vectors of course. Anyway, Thanks for your help!

“It may be that you’re not compiling for the correct GPU type.”

I am also facing the same issue i.e for lower value my code runs but as I increase the value of N, program crashes. I use Nvidia Quadro 4000 GPU. Can you please elaborate on “compiling for correct GPU type”? Do I have to specify some command while compiling?
I use python and use

nvprof python

to run.

Quadro 4000 is a compute capability 2.0 GPU, so the maximum limit for such an operation would be 65535

If you are using a recent version of CUDA such as CUDA 7, 7.5 or CUDA 8, the default compilation is for cc 2.0 so there is nothing you need to change.

Instead you will need to modify your code to take this limit into account. For example, in your case, on a cc 2.0 device where you want to handle a vector larger than 32000 or 65535, you could rewrite your vectorize routine using guvectorize:

Note that numba is not an NVIDIA product; it is published by

In my case, at N=60000000 program runs fine but for N=70000000 it returns following error:
(I am using 1024 theads per block)

raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

As per your answer if I vectorize my routine using guvectorize then this will be resolved. Am I right?

I also read that pyCuda can be used to run algorithms on GPU. Which one should I use among Numba and pyCuda to make my algorithm parallel. Is there any major difference between the two? Which one provides more functionality on the GPU?
My understanding is that we need to pass only numpy arrays to the GPU while using Numba. If we use pyCuda, can it take other inputs as well like lists?

Which one should I opt for?

yes, 60000000/1024 is less than the 65535 limit, but 70000000/1024 is greater than the 65535 limit.

I think you should be able to use guvectorize to work around this limit in this case, by having each thread perform the vector add for multiple elements of your vector. you could also just write a numba cuda kernel using numba cuda.jit capability.

numba allows you to write the kernel code in python.

pycuda will generally require you to write the kernel code in CUDA C++

pycuda gives more kernel code capability/flexibility than numba

there are a lot of other differences between numba and pycuda, I wouldn’t be able to summarize all the differences here.

Thanks a lot for your help. Appreciate it.