Hi
I would like to use numba with python to accelerate a part of my code ( a matrix computation in order to retrieve the distance of the detected object bounding box from tensorflow.
But everytime i run my code, after some second i got an error from numba and i am not sure where it comes from.
I have tried to decrease my blocksize and thread per block in my cuda kernel declaration but the problem is still here.
Is it possible it happens because the GPU is too small to be used with tensorflow and numba at the same time?
as we can see, tensorflow created its device with 266MB and there is only 864MB free.
here is the error code i got:
totalMemory: 3.86GiB freeMemory: 864.35MiB
2019-07-01 17:01:30.047620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-01 17:01:36.958809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-01 17:01:36.958920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-07-01 17:01:36.958967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-07-01 17:01:36.959312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 266 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:add pending dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 772636 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 169824 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
INFO:numba.cuda.cudadrv.driver:dealloc: cuMemFree_v2 189288 bytes
...
...
ERROR:numba.cuda.cudadrv.driver:Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
Traceback (most recent call last):
File "realSense_tf_trt.py", line 183, in <module>
main()
File "realSense_tf_trt.py", line 173, in main
loop_and_detect(cam, tf_sess, DEFAULT_THRS, vis, od_type=od_type)
File "realSense_tf_trt.py", line 110, in loop_and_detect
img = vis.draw_bboxes(img, depth, box, conf, cls, intri)
File "/home/enroutenano/thirdPartyLib/TFTRT/faceDetec_trt/enroute_utils/visualization.py", line 173, in draw_bboxes
getDistance[blockspergrid, threadperblock](depthObj, intri.ppx, intri.ppy, intri.fx, intri.fy, d_ptsX, d_ptsY, d_ptsZ)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 808, in __call__
cfg(*args)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 538, in __call__
sharedmem=self.sharedmem)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/compiler.py", line 612, in _kernel_call
cu_func(*kernelargs)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 1517, in __call__
self.sharedmem, streamhandle, args)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 1561, in launch_kernel
None)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 293, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/home/enroutenano/.virtualenvs/p2_deepLearning/local/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.py", line 328, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
and here is my cuda kernel declaration:
@cuda.jit
def getDistance(depthImg, ppx, ppy, fx, fy, pts3D_x, pts3D_y, pts3D_z):
#x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bw = cuda.blockDim.x
posX = tx + bx * bw
posY = ty + by * bw
if posX < depthImg.shape[0] and posY < depthImg.shape[1]:
pts3D_x[posX,posY] = (posX - ppx) * depthImg[posX,posY] / fx
pts3D_y[posX,posY] = (posY - ppy) * depthImg[posY, posY] / fy
pts3D_z[posX,posY] = depthImg[posX, posY]
and here is the function where i call the cuda function, basically it is for each object detected from tensorflow.
def draw_bboxes(self, img, depth, box, conf, cls, intri):
"""Draw detected bounding boxes on the original image."""
for bb, cf, cl in zip(box, conf, cls):
cl = int(cl)
y_min, x_min, y_max, x_max = bb[0], bb[1], bb[2], bb[3]
depthObj = depth[x_min:x_max,y_min:y_max].astype(np.float32)
depthObj = depthObj * 0.001
y, x = depthObj.shape
pts3D_X = np.empty_like(depthObj)
pts3D_Y = np.empty_like(depthObj)
pts3D_Z = np.empty_like(depthObj)
#cuda memory
d_ptsX = cuda.to_device(pts3D_X)
d_ptsY = cuda.to_device(pts3D_Y)
d_ptsZ = cuda.to_device(pts3D_Z)
threadperblock = (2, 1)
blockspergrid_x = int(math.ceil(y / 2))
blockspergrid_y = int(math.ceil(x / 1))
blockspergrid = (blockspergrid_x, blockspergrid_y)
getDistance[blockspergrid, threadperblock](depthObj, intri.ppx, intri.ppy, intri.fx, intri.fy, d_ptsX, d_ptsY, d_ptsZ)
#
pts3D_X = d_ptsX.copy_to_host()
pts3D_Y = d_ptsY.copy_to_host()
pts3D_Z = d_ptsZ.copy_to_host()
pts3D_X = np.reshape(pts3D_X, x*y)
pts3D_Y = np.reshape(pts3D_Y, x*y)
pts3D_Z = np.reshape(pts3D_Z, x*y)
if len(pts3D_X) > 0:
x = np.median(pts3D_X)
y = np.median(pts3D_Y)
z = np.median(pts3D_Z)
distance = str(math.sqrt(x*x + y*y + z*z))
else:
distance = "??"
color = self.colors[cl-1]
cv2.rectangle(img, (x_min, y_min), (x_max, y_max), color, 2)
txt_loc = (max(x_min+2, 0), max(y_min+2, 0))
cls_name = self.cls_dict.get(cl-1)
txt = '{} {:.2f} {} meters'.format(cls_name, cf, distance)
img = draw_boxed_text(img, txt, txt_loc, color)
return img
without CUDA, i can run the code but due to the process of distance computation from depth image, the fps decrease from 15 without detected object to 5 fps when an object is detected.
That is the reason why i wanna compute the distance using the GPU.
does someone have an idea of this numba error.
Thanks in advance.