UNKNOWN_CUDA_ERROR after/during kernel execution using Numba

Hello,

I am quite new to CUDA and I am stuck with an error when trying to run this code with the variable quantities being a larger array or with the grid or block dimensions being higher. The code has the expected behaviour with smaller arrays.

I run this test on Windows 10 on a GTX 980. The size of the array causing the error may depend on your GPU, it seems I am exceeding a limit somewhere. The error looks like :

CudaAPIError                              Traceback (most recent call last)
<ipython-input-9-29f1940fb1d0> in <module>()
     33 
     34 #Get the output of the calculations from the GPU to the host
---> 35 combs = out_device.copy_to_host()
     36 
     37 print(combs)

G:\Applications\Anaconda\lib\site-packages\numba\cuda\cudadrv\devices.py in _require_cuda_context(*args, **kws)
    210     def _require_cuda_context(*args, **kws):
    211         get_context()
--> 212         return fn(*args, **kws)
    213 
    214     return _require_cuda_context

G:\Applications\Anaconda\lib\site-packages\numba\cuda\cudadrv\devicearray.py in copy_to_host(self, ary, stream)
    250         assert self.alloc_size >= 0, "Negative memory size"
    251         if self.alloc_size != 0:
--> 252             _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
    253 
    254         if ary is None:

G:\Applications\Anaconda\lib\site-packages\numba\cuda\cudadrv\driver.py in device_to_host(dst, src, size, stream)
   1774         fn = driver.cuMemcpyDtoH
   1775 
-> 1776     fn(host_pointer(dst), device_pointer(src), size, *varargs)
   1777 
   1778 

G:\Applications\Anaconda\lib\site-packages\numba\cuda\cudadrv\driver.py in safe_cuda_api_call(*args)
    286             _logger.debug('call driver api: %s', libfn.__name__)
    287             retcode = libfn(*args)
--> 288             self._check_error(fname, retcode)
    289         return safe_cuda_api_call
    290 

G:\Applications\Anaconda\lib\site-packages\numba\cuda\cudadrv\driver.py in _check_error(self, fname, retcode)
    321                     _logger.critical(msg, _getpid(), self.pid)
    322                     raise CudaDriverError("CUDA initialized before forking")
--> 323             raise CudaAPIError(retcode, msg)
    324 
    325     def get_device(self, devnum=0):

CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

The goal of the function is to parallelize the generation of combinations (not the “mathematical” combination) and to perform a computation on each of these. The array of quantities being significantly larger.

import numpy as np
import math
from numba import njit, jit, cuda, vectorize, guvectorize
import numba

### FUNCTIONS TO BE MERGED
def power_added(quantities):
    pow_add = [1]
    result=1
    cpy = quantities.copy()
    while cpy:
        result *=cpy.pop(0)+1
        pow_add.append(result)
        
    pow_add.pop(-1)
    
    return pow_add
	
def power(quantities):   
    result=1
    cpy = quantities.copy()
    
    while cpy:
        result *=cpy.pop(0)+1
        
    return result

# Definition of variables to be processed
# quantities = [3,4,5,6,7] #returns an error
quantities = [3,4,5,6]
pow_add = power_added(quantities)
pow = power(quantities)

length_q = len(quantities)

# Transformation into tuple to pass to the kernel
q_tup = tuple(quantities)
pa_tup = tuple(pow_add)

#Number of combinations per thread (must be a multiple of the nb of comb):
cpt = 20

#Affects data to the device to avoid useless transfers
q_device = cuda.to_device(quantities)
pa_device = cuda.to_device(pow_add)

out_device = cuda.device_array(shape=(pow//cpt,cpt,length_q), dtype=np.int32)
#print(out_device.copy_to_host())

@cuda.jit
def kernel(an_array, q, pa):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    # The above is equivalent to pos = roc.get_global_id(0)
    if pos < an_array.size:  # Check array boundaries
        """an_array[pos] = compute(x,y)""" 
        compute(pos, q, pa, an_array[pos])

@cuda.jit(device = True)
def compute(index,q,pa,comb):
    
    #Get the size of 'quantities' to know the nb of numbers per combination
    length_q = len(q)
    
    #If we are to compute multiple combinations in a single thread
    for i in range(cpt):
        #For each number in the combination, compute it
        for L in range(length_q):
            comb[i][L] = (cpt*index+i)//pa[L]%(q[L]+1)
        
    # To improve performance :
    # Compute yield of the combination in the thread and only return the best
    # one and its yield (to compare with other threads): local optimum

#Caution when choosing the dimensions, use the device info & size of output
n_blocks = 30 # must be <= MAX_GRID_DIM_X
n_threads_per_block = 128 # must be <= MAX_THREADS_PER_BLOCK (if multi dim, the product must be <=)

#Calls the function with the data already stored on the device
kernel[n_blocks, n_threads_per_block](out_device, q_device, pa_device)

#Get the output of the calculations from the GPU to the host
combs = out_device.copy_to_host()

print(combs)

and my GPU returns :

Global memory occupancy:80.765915% free

===Attributes for device 0
ASYNC_ENGINE_COUNT:2
CAN_MAP_HOST_MEMORY:1
CLOCK_RATE:1291000
COMPUTE_CAPABILITY_MAJOR:5
COMPUTE_CAPABILITY_MINOR:2
COMPUTE_MODE:DEFAULT
CONCURRENT_KERNELS:1
ECC_ENABLED:0
GLOBAL_L1_CACHE_SUPPORTED:1
GLOBAL_MEMORY_BUS_WIDTH:256
GPU_OVERLAP:1
INTEGRATED:0
KERNEL_EXEC_TIMEOUT:1
L2_CACHE_SIZE:2097152
LOCAL_L1_CACHE_SUPPORTED:1
MANAGED_MEMORY:1
MAXIMUM_SURFACE1D_LAYERED_LAYERS:2048
MAXIMUM_SURFACE1D_LAYERED_WIDTH:16384
MAXIMUM_SURFACE1D_WIDTH:16384
MAXIMUM_SURFACE2D_HEIGHT:65536
MAXIMUM_SURFACE2D_LAYERED_HEIGHT:16384
MAXIMUM_SURFACE2D_LAYERED_LAYERS:2048
MAXIMUM_SURFACE2D_LAYERED_WIDTH:16384
MAXIMUM_SURFACE2D_WIDTH:65536
MAXIMUM_SURFACE3D_DEPTH:4096
MAXIMUM_SURFACE3D_HEIGHT:4096
MAXIMUM_SURFACE3D_WIDTH:4096
MAXIMUM_SURFACECUBEMAP_LAYERED_LAYERS:2046
MAXIMUM_SURFACECUBEMAP_LAYERED_WIDTH:16384
MAXIMUM_SURFACECUBEMAP_WIDTH:16384
MAXIMUM_TEXTURE1D_LAYERED_LAYERS:2048
MAXIMUM_TEXTURE1D_LAYERED_WIDTH:16384
MAXIMUM_TEXTURE1D_LINEAR_WIDTH:134217728
MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH:16384
MAXIMUM_TEXTURE1D_WIDTH:65536
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT:16384
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES:2048
MAXIMUM_TEXTURE2D_ARRAY_WIDTH:16384
MAXIMUM_TEXTURE2D_GATHER_HEIGHT:16384
MAXIMUM_TEXTURE2D_GATHER_WIDTH:16384
MAXIMUM_TEXTURE2D_HEIGHT:65536
MAXIMUM_TEXTURE2D_LINEAR_HEIGHT:65536
MAXIMUM_TEXTURE2D_LINEAR_PITCH:1048544
MAXIMUM_TEXTURE2D_LINEAR_WIDTH:65536
MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT:16384
MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH:16384
MAXIMUM_TEXTURE2D_WIDTH:65536
MAXIMUM_TEXTURE3D_DEPTH:4096
MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE:16384
MAXIMUM_TEXTURE3D_HEIGHT:4096
MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE:2048
MAXIMUM_TEXTURE3D_WIDTH:4096
MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE:2048
MAXIMUM_TEXTURECUBEMAP_LAYERED_LAYERS:2046
MAXIMUM_TEXTURECUBEMAP_LAYERED_WIDTH:16384
MAXIMUM_TEXTURECUBEMAP_WIDTH:16384
MAX_BLOCK_DIM_X:1024
MAX_BLOCK_DIM_Y:1024
MAX_BLOCK_DIM_Z:64
MAX_GRID_DIM_X:2147483647
MAX_GRID_DIM_Y:65535
MAX_GRID_DIM_Z:65535
MAX_PITCH:2147483647
MAX_REGISTERS_PER_BLOCK:65536
MAX_REGISTERS_PER_MULTIPROCESSOR:65536
MAX_SHARED_MEMORY_PER_BLOCK:49152
MAX_SHARED_MEMORY_PER_MULTIPROCESSOR:98304
MAX_THREADS_PER_BLOCK:1024
MAX_THREADS_PER_MULTIPROCESSOR:2048
MEMORY_CLOCK_RATE:3505000
MULTIPROCESSOR_COUNT:16
MULTI_GPU_BOARD:0
MULTI_GPU_BOARD_GROUP_ID:0
PCI_BUS_ID:1
PCI_DEVICE_ID:0
PCI_DOMAIN_ID:0
STREAM_PRIORITIES_SUPPORTED:1
SURFACE_ALIGNMENT:512
TCC_DRIVER:0
TEXTURE_ALIGNMENT:512
TEXTURE_PITCH_ALIGNMENT:32
TOTAL_CONSTANT_MEMORY:65536
UNIFIED_ADDRESSING:1
WARP_SIZE:32

Some posts indicated this error could come from the Windows WDDM TDR, so I disabled it and nothing changed.

How do I determine the max grid/block dimensions ? I assumed MAX_THREADS_PER_BLOCK:1024 and MAX_GRID_DIM_X:2147483647 would be my limits and MULTIPROCESSOR_COUNT:16 indicates the number of blocks running at the same time.

What causes this error and how can I prevent it ? It could be a memory issue, but it is surprising as the size of the array required in the exemple is 336205

Thank you for your help !

[s]I ran your code with:

quantities = [3,4,5,6]

and:

quantities = [3,4,5,6,7]

and in both cases, no errors were thrown. I’m running on a “much smaller” GPU than yours, a Quadro K2000 which has two cc 3.0 SMs. (MULTIPROCESSOR_COUNT = 2) However I am running on linux and there is no kernel timeout.

my guess would be you did it wrong

Your assumptions for grid/block dimensions are correct. blocks are limited to 1024 and grid x dimension can be quite large. You are not hitting any limits there, and imagining that there is some limit there is not the source of your problem. Your statement about multiprocessor count is not quite correct, but it is also not relevant to the issue you have.

I think it is likely a kernel timeout. I think it is likely that you have not correctly disabled the WDDM timeout. Your code runs correctly for me on linux where there is no timeout.[/s]

Thank you for the reply.

I used Nsight Monitor to disable the timeout as described here:https://tinyurl.com/yaz7ctw7.
I tried both setting the TDR to 10s and disabling it, rebooting the system between each.

Moreover, I discovered cuda-memcheck, ran it on my file and it returns many errors like this one:

========= Invalid __global__ write of size 4
=========     at 0x00000bd8 in cudapy::__main__::kernel$241(Array<int, int=3, A, mutable, aligned>,
 Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>)
=========     by thread (127,0,0) in block (15,0,0)
=========     Address 0x8006c8270 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuLaunchKernel + 0x1fd) [0x1c116d]
=========     Host Frame:G:\Applications\Anaconda\DLLs\_ctypes.pyd (DllCanUnloadNow + 0x7313) [0x11b23]
=========     Host Frame:G:\Applications\Anaconda\DLLs\_ctypes.pyd (DllCanUnloadNow + 0x59e3) [0x101f3]
=========     Host Frame:G:\Applications\Anaconda\DLLs\_ctypes.pyd (DllCanUnloadNow + 0xf63) [0xb773]
=========     Host Frame:G:\Applications\Anaconda\DLLs\_ctypes.pyd (DllCanUnloadNow + 0x1ab9) [0xc2c9]
=========     Host Frame:G:\Applications\Anaconda\DLLs\_ctypes.pyd [0x6f6a]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyNumber_Long + 0x6d8) [0x10544]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0x1142) [0x2fab2]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeWithName + 0x1a8) [0x1f188]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyMethodDef_RawFastCallKeywords + 0xc03) [0x2e813]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0x8ae) [0x2f21e]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyMethodDef_RawFastCallKeywords + 0xaea) [0x2e6fa]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0x4ab) [0x2ee1b]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeWithName + 0x1a8) [0x1f188]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyFunction_FastCallDict + 0x1ba) [0x1ee0a]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyObject_Call_Prepend + 0x6c) [0x5930c]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyType_GetDocFromInternalDoc + 0x239) [0x59279]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyNumber_Long + 0x6d8) [0x10544]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0x1142) [0x2fab2]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeWithName + 0x1a8) [0x1f188]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyMethodDef_RawFastCallKeywords + 0xc03) [0x2e813]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0xfeb) [0x2f95b]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeWithName + 0x1a8) [0x1f188]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyFunction_FastCallDict + 0x1ba) [0x1ee0a]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyObject_Call_Prepend + 0x6c) [0x5930c]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyType_GetDocFromInternalDoc + 0x239) [0x59279]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyNumber_Long + 0x6d8) [0x10544]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0x1142) [0x2fab2]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeWithName + 0x1a8) [0x1f188]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyFunction_FastCallDict + 0x1ba) [0x1ee0a]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyObject_Call_Prepend + 0x6c) [0x5930c]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyType_GetDocFromInternalDoc + 0x239) [0x59279]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyObject_FastCallKeywords + 0x17a) [0x2674a]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyMethodDef_RawFastCallKeywords + 0xc35) [0x2e845]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalFrameDefault + 0x4ab) [0x2ee1b]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeWithName + 0x1a8) [0x1f188]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCodeEx + 0x9b) [0x4963]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyEval_EvalCode + 0x2d) [0x48c1]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyArena_Free + 0xcb) [0x486b]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyRun_FileExFlags + 0xc5) [0x1b7541]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyRun_SimpleFileExFlags + 0x250) [0x1b7d70]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyRun_AnyFileExFlags + 0x63) [0x1b740f]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (Py_UnixMain + 0x653) [0xfd4d7]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (Py_UnixMain + 0x6fe) [0xfd582]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyErr_NoMemory + 0x31fba) [0x92776]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (PyInterpreterState_New + 0x3aa) [0x3166a]
=========     Host Frame:G:\Applications\Anaconda\python37.dll (Py_Main + 0x158) [0x312b4]
=========     Host Frame:G:\Applications\Anaconda\python.exe [0x1260]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17bd4]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6cee1]

It returns the same kind of error on files that run properly. Any clue about the origin of this invalid write operation ?

Yes, you are correct, the code has errors, and this is likely the source of the problem, not a kernel timeout as I previously stated. My bad for steering in the wrong direction.

You are indexing out of bounds for one or more of your arrays.

This occurs even when I set

quantities = [3,4]

although fewer errors are produced. They are all of the invalid write type, so that means you are indexing out-of-bounds on the writes to an_array

print out the shape of your out_device array. Then see if the indexing into it makes sense.

This doesn’t look right to me:

if pos < an_array.size:  # Check array boundaries

don’t you want something like:

if pos < an_array.shape[0]:  # Check array boundaries

I checked my array, that was previously a 1D array, and is now a 3D array. Thus, the array indexing is not right anymore.

I did exactly the modification you are suggesting, and it works like a charm.

Thank you for your help. Issue solved.