nano + tensorflow + numba

schoninger · July 1, 2019, 8:17am

Hi

I would like to use numba with python to accelerate a part of my code ( a matrix computation in order to retrieve the distance of the detected object bounding box from tensorflow.
But everytime i run my code, after some second i got an error from numba and i am not sure where it comes from.
I have tried to decrease my blocksize and thread per block in my cuda kernel declaration but the problem is still here.
Is it possible it happens because the GPU is too small to be used with tensorflow and numba at the same time?
as we can see, tensorflow created its device with 266MB and there is only 864MB free.
here is the error code i got:

totalMemory: 3.86GiB freeMemory: 864.35MiB
2019-07-01 17:01:30.047620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-01 17:01:36.958809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 17:01:36.958920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-01 17:01:36.958967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-01 17:01:36.959312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 266 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)

INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
...
...
ERROR:numba.cuda.cudadrv.driver:Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
Traceback (most recent call last):
  File "realSense_tf_trt.py", line 183, in <module>
    main()
  File "realSense_tf_trt.py", line 173, in main
    loop_and_detect(cam, tf_sess, DEFAULT_THRS, vis, od_type=od_type)
  File "realSense_tf_trt.py", line 110, in loop_and_detect
    img = vis.draw_bboxes(img, depth, box, conf, cls, intri)
  File "/home/enroutenano/thirdPartyLib/TFTRT/faceDetec_trt/enroute_utils/visualization.py", line 173, in draw_bboxes
    getDistance[blockspergrid, threadperblock](depthObj, intri.ppx, intri.ppy, intri.fx, intri.fy, d_ptsX, d_ptsY, d_ptsZ)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 808, in __call__
    cfg(*args)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 538, in __call__
    sharedmem=self.sharedmem)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 612, in _kernel_call
    cu_func(*kernelargs)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 1517, in __call__
    self.sharedmem, streamhandle, args)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 1561, in launch_kernel
    None)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 293, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 328, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE

and here is my cuda kernel declaration:

@cuda.jit
def getDistance(depthImg, ppx, ppy, fx, fy, pts3D_x, pts3D_y, pts3D_z):
	#x, y = cuda.grid(2)
	tx = cuda.threadIdx.x
	ty = cuda.threadIdx.y
	bx = cuda.blockIdx.x
	by = cuda.blockIdx.y
	bw = cuda.blockDim.x
	posX = tx + bx * bw
	posY = ty + by * bw
	
	if posX < depthImg.shape[0] and posY < depthImg.shape[1]:
		pts3D_x[posX,posY] = (posX - ppx) * depthImg[posX,posY] / fx
		pts3D_y[posX,posY] = (posY - ppy) * depthImg[posY, posY] / fy
		pts3D_z[posX,posY] = depthImg[posX, posY]

and here is the function where i call the cuda function, basically it is for each object detected from tensorflow.

def draw_bboxes(self, img, depth, box, conf, cls, intri):
		"""Draw detected bounding boxes on the original image."""
		for bb, cf, cl in zip(box, conf, cls):
			cl = int(cl)
			y_min, x_min, y_max, x_max = bb[0], bb[1], bb[2], bb[3]
			depthObj = depth[x_min:x_max,y_min:y_max].astype(np.float32)
			depthObj = depthObj * 0.001
			y, x = depthObj.shape

	
			pts3D_X = np.empty_like(depthObj)
			pts3D_Y = np.empty_like(depthObj)
			pts3D_Z = np.empty_like(depthObj)
			#cuda memory
			d_ptsX = cuda.to_device(pts3D_X)
			d_ptsY = cuda.to_device(pts3D_Y)
			d_ptsZ = cuda.to_device(pts3D_Z)
			
			threadperblock = (2, 1)
			blockspergrid_x = int(math.ceil(y / 2))
			blockspergrid_y = int(math.ceil(x / 1))
			blockspergrid = (blockspergrid_x, blockspergrid_y)
			getDistance[blockspergrid, threadperblock](depthObj, intri.ppx, intri.ppy, intri.fx, intri.fy, d_ptsX, d_ptsY, d_ptsZ)
			#
			pts3D_X = d_ptsX.copy_to_host()
			pts3D_Y = d_ptsY.copy_to_host()
			pts3D_Z = d_ptsZ.copy_to_host()
			
			pts3D_X = np.reshape(pts3D_X, x*y)
			pts3D_Y = np.reshape(pts3D_Y, x*y)
			pts3D_Z = np.reshape(pts3D_Z, x*y)
			if len(pts3D_X) > 0:
				x = np.median(pts3D_X)
				y = np.median(pts3D_Y)
				z = np.median(pts3D_Z)
				distance = str(math.sqrt(x*x + y*y + z*z))
			else:
				distance = "??"
			color = self.colors[cl-1]
			cv2.rectangle(img, (x_min, y_min), (x_max, y_max), color, 2)
			txt_loc = (max(x_min+2, 0), max(y_min+2, 0))
			cls_name = self.cls_dict.get(cl-1)
			txt = '{} {:.2f}  {} meters'.format(cls_name, cf, distance)
			img = draw_boxed_text(img, txt, txt_loc, color)
		return img

without CUDA, i can run the code but due to the process of distance computation from depth image, the fps decrease from 15 without detected object to 5 fps when an object is detected.
That is the reason why i wanna compute the distance using the GPU.

does someone have an idea of this numba error.
Thanks in advance.

AastaLLL · July 9, 2019, 6:39am

Hi,

CUDA_ERROR_INVALID_VALUE indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.

Here are two suggestions for you:

1. Please check if all the parameters in your app are valid.
You can get the hardware information with this sample:

$ /usr/local/cuda-10.0/bin/cuda-install-samples-10.0.sh .
$ cd NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

2. Please check if your numba library is complied with correct GPU architecture. Nano is sm=5.3.

Thanks.

schoninger · July 18, 2019, 8:48am

Dear AastaLLL,

Thank you for your answer.
Indeed, i have checked the index i was using regarding the CUDA grid etc and as you said, there was a value which was out of range and thus lead to the error.
Once again, Thank you for your answer.

best

Topic		Replies	Views
Tensorflow Memory Error Jetson TX2	25	15522	October 18, 2021
Testing the excecution with and with out GPU and CUDA in Jetson TX2 Jetson TX2	4	3335	October 18, 2021
CUDA driver version is insufficient for CUDA runtime Jetson TX2	3	1265	October 18, 2021
CUDA_ERROR_LAUNCH_FAILED on Jetson Nano (4GB), Tensorflow 2.5.0, Python 3.6.9 Jetson Nano cuda , tensorflow , ubuntu , jetson-inference , python	4	1770	October 15, 2021
TensorFlow C-library with CUDA support gets stuck Jetson Nano	5	1444	October 18, 2021
Installation of numba on the jetson nano Jetson Nano cuda , python	4	1003	October 15, 2021
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE in Python CUDA Programming and Performance	11	9833	May 16, 2024
Run Python with CUDA on Jetson TX2 Jetson TX2	2	820	October 18, 2021
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5780	October 18, 2021
Error running CUDA Python code in Jupyter Notebook after installing NVIDIA drivers CUDA Programming and Performance cuda , python	12	1234	August 7, 2023

nano + tensorflow + numba

Related topics