nano + tensorflow + numba

Hi

I would like to use numba with python to accelerate a part of my code ( a matrix computation in order to retrieve the distance of the detected object bounding box from tensorflow.
But everytime i run my code, after some second i got an error from numba and i am not sure where it comes from.
I have tried to decrease my blocksize and thread per block in my cuda kernel declaration but the problem is still here.
Is it possible it happens because the GPU is too small to be used with tensorflow and numba at the same time?
as we can see, tensorflow created its device with 266MB and there is only 864MB free.
here is the error code i got:

totalMemory: 3.86GiB freeMemory: 864.35MiB
2019-07-01 17:01:30.047620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-01 17:01:36.958809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 17:01:36.958920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-01 17:01:36.958967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-01 17:01:36.959312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 266 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)

INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
...
...
ERROR:numba.cuda.cudadrv.driver:Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
Traceback (most recent call last):
  File "realSense_tf_trt.py", line 183, in <module>
    main()
  File "realSense_tf_trt.py", line 173, in main
    loop_and_detect(cam, tf_sess, DEFAULT_THRS, vis, od_type=od_type)
  File "realSense_tf_trt.py", line 110, in loop_and_detect
    img = vis.draw_bboxes(img, depth, box, conf, cls, intri)
  File "/home/enroutenano/thirdPartyLib/TFTRT/faceDetec_trt/enroute_utils/visualization.py", line 173, in draw_bboxes
    getDistance[blockspergrid, threadperblock](depthObj, intri.ppx, intri.ppy, intri.fx, intri.fy, d_ptsX, d_ptsY, d_ptsZ)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 808, in __call__
    cfg(*args)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 538, in __call__
    sharedmem=self.sharedmem)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 612, in _kernel_call
    cu_func(*kernelargs)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 1517, in __call__
    self.sharedmem, streamhandle, args)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 1561, in launch_kernel
    None)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 293, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 328, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

and here is my cuda kernel declaration:

@cuda.jit
def getDistance(depthImg, ppx, ppy, fx, fy, pts3D_x, pts3D_y, pts3D_z):
	#x, y = cuda.grid(2)
	tx = cuda.threadIdx.x
	ty = cuda.threadIdx.y
	bx = cuda.blockIdx.x
	by = cuda.blockIdx.y
	bw = cuda.blockDim.x
	posX = tx + bx * bw
	posY = ty + by * bw
	
	if posX < depthImg.shape[0] and posY < depthImg.shape[1]:
		pts3D_x[posX,posY] = (posX - ppx) * depthImg[posX,posY] / fx
		pts3D_y[posX,posY] = (posY - ppy) * depthImg[posY, posY] / fy
		pts3D_z[posX,posY] = depthImg[posX, posY]

and here is the function where i call the cuda function, basically it is for each object detected from tensorflow.

def draw_bboxes(self, img, depth, box, conf, cls, intri):
		"""Draw detected bounding boxes on the original image."""
		for bb, cf, cl in zip(box, conf, cls):
			cl = int(cl)
			y_min, x_min, y_max, x_max = bb[0], bb[1], bb[2], bb[3]
			depthObj = depth[x_min:x_max,y_min:y_max].astype(np.float32)
			depthObj = depthObj * 0.001
			y, x = depthObj.shape

	
			pts3D_X = np.empty_like(depthObj)
			pts3D_Y = np.empty_like(depthObj)
			pts3D_Z = np.empty_like(depthObj)
			#cuda memory
			d_ptsX = cuda.to_device(pts3D_X)
			d_ptsY = cuda.to_device(pts3D_Y)
			d_ptsZ = cuda.to_device(pts3D_Z)
			
			threadperblock = (2, 1)
			blockspergrid_x = int(math.ceil(y / 2))
			blockspergrid_y = int(math.ceil(x / 1))
			blockspergrid = (blockspergrid_x, blockspergrid_y)
			getDistance[blockspergrid, threadperblock](depthObj, intri.ppx, intri.ppy, intri.fx, intri.fy, d_ptsX, d_ptsY, d_ptsZ)
			#
			pts3D_X = d_ptsX.copy_to_host()
			pts3D_Y = d_ptsY.copy_to_host()
			pts3D_Z = d_ptsZ.copy_to_host()
			
			pts3D_X = np.reshape(pts3D_X, x*y)
			pts3D_Y = np.reshape(pts3D_Y, x*y)
			pts3D_Z = np.reshape(pts3D_Z, x*y)
			if len(pts3D_X) > 0:
				x = np.median(pts3D_X)
				y = np.median(pts3D_Y)
				z = np.median(pts3D_Z)
				distance = str(math.sqrt(x*x + y*y + z*z))
			else:
				distance = "??"
			color = self.colors[cl-1]
			cv2.rectangle(img, (x_min, y_min), (x_max, y_max), color, 2)
			txt_loc = (max(x_min+2, 0), max(y_min+2, 0))
			cls_name = self.cls_dict.get(cl-1)
			txt = '{} {:.2f}  {} meters'.format(cls_name, cf, distance)
			img = draw_boxed_text(img, txt, txt_loc, color)
		return img

without CUDA, i can run the code but due to the process of distance computation from depth image, the fps decrease from 15 without detected object to 5 fps when an object is detected.
That is the reason why i wanna compute the distance using the GPU.

does someone have an idea of this numba error.
Thanks in advance.

Hi,

CUDA_ERROR_INVALID_VALUE indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.

Here are two suggestions for you:

1. Please check if all the parameters in your app are valid.
You can get the hardware information with this sample:

$ /usr/local/cuda-10.0/bin/cuda-install-samples-10.0.sh .
$ cd NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

2. Please check if your numba library is complied with correct GPU architecture. Nano is sm=5.3.

Thanks.

Dear AastaLLL,

Thank you for your answer.
Indeed, i have checked the index i was using regarding the CUDA grid etc and as you said, there was a value which was out of range and thus lead to the error.
Once again, Thank you for your answer.

best