CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE with Numba

I’m getting the error: “numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE” when trying to run my code.

I’m new to working with CUDA, but as far as I can see the code is within the limitations for the threads per block (less than 1024) and blocks per grid (less than 65535).
I’ve tried running this on 4 different GPUs (GTX 1070, GTX 1080, GTX 1080 Ti, TITAN X (Pascal)) all with the same error.

The error only occurs when the parameters ‘dec’ and ‘fsize’ are larger than 32 and 40 respectively.

Could anyone help me with figuring out why exactly this error occurs for these parameter values? Or is there something I am doing completely wrong here?

Much thanks.

My Code:

import numpy as np
import time
from numba import cuda


@cuda.jit("void(float32[:,:], float32[:,:,:])")
def ANAM(f, A):
    x, tau, i = cuda.grid(3)
    tau += 1
    n = f.shape[1]
    if x < A.shape[0] and tau < A.shape[1]+1 and i < n-tau and i >= tau:
        out = float(0)
        denom = (tau+1)**2/(n-2*tau)
        for j in range(0, tau+1):
            for l in range(0, tau+1):
                out += abs(f[x,i+j]-f[x,i-l])
        A[x, tau-1, i] =  out/denom


cuda.select_device(3)  
gpu = cuda.get_current_device()
print(gpu.name)

flength = 832 # fixed value

dec = 32 # larger value than this causes error
fsize = 40 # value larger than 36 causes error

# example data for testing
fs = np.tile(np.sin(np.linspace(-10,10,flength), dtype=np.float32), (fsize,1))


threadsperblock = (4, 4, 64)
blockspergrid = ((fsize + (threadsperblock[0] - 1)) // threadsperblock[0], 
                  (dec + (threadsperblock[1] - 1)) // threadsperblock[1], 
                  (fs.shape[1] + (threadsperblock[2] - 1)) // threadsperblock[2])

outbuf = np.zeros((fsize, dec-1, fs.shape[1]), dtype=np.float32)

'''
fs = cuda.to_device(fs)
outbuf = cuda.to_device(outbuf)
'''

t1 = time.perf_counter()
ANAM[threadsperblock, blockspergrid](fs, outbuf)
cuda.synchronize()

print(time.perf_counter() - t1)


'''
outbuf = outbuf.copy_to_host()
'''


outbuf = outbuf.sum(axis=2)
log_out = np.log(outbuf)
log_taus = np.log(np.arange(1, dec))
lin_regress_denom = ((log_taus**2).mean() - (log_taus.mean())**2)

print(2-((log_taus*log_out).mean(axis=1)-log_taus.mean()*log_out.mean(axis=1))/lin_regress_denom)

Error message:

Traceback (most recent call last):
  File "numbatest.py", line 54, in <module>
    ANAM[threadsperblock, blockspergrid](fs, outbuf)
  File "/home/robin.vanderlaag/pythonEnvironments/lib64/python3.6/site-packages/numba/cuda/compiler.py", line 822, in __call__
    self.stream, self.sharedmem)
  File "/home/robin.vanderlaag/pythonEnvironments/lib64/python3.6/site-packages/numba/cuda/compiler.py", line 966, in call
    kernel.launch(args, griddim, blockdim, stream, sharedmem)
  File "/home/robin.vanderlaag/pythonEnvironments/lib64/python3.6/site-packages/numba/cuda/compiler.py", line 699, in launch
    cooperative=self.cooperative)
  File "/home/robin.vanderlaag/pythonEnvironments/lib64/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 2100, in launch_kernel
    None)
  File "/home/robin.vanderlaag/pythonEnvironments/lib64/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 300, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/robin.vanderlaag/pythonEnvironments/lib64/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 335, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

Output of numba.cuda.detect():

Found 4 CUDA devices
id 0    b'NVIDIA GeForce GTX 1070'                              [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 2
id 1    b'NVIDIA GeForce GTX 1080'                              [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 3
id 2    b'NVIDIA TITAN X (Pascal)'                              [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 129
id 3    b'NVIDIA GeForce GTX 1080 Ti'                              [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 130
Summary:
        4/4 devices are supported

Output of nvidia-smi:

Sun Jan 16 11:52:13 2022       -----------------------------------------------+
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 22%   28C    P0    33W / 151W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 22%   28C    P0    40W / 180W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN X ...  Off  | 00000000:81:00.0 Off |                  N/A |
| 18%   32C    P0    56W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:82:00.0 Off |                  N/A |
| 18%   26C    P0    55W / 250W |      0MiB / 11178MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

That is incorrect. It should be:

ANAM[blockspergrid, threadsperblock]

see here