Python OpenCV - multiprocessing doesn't work with CUDA


I am trying to run CUDA ORB key-point detection with multiple GPUs. The principle of work is to split list of video frames between available GPU devices (load them into GPU memory). However when I run it with multiple threads by threading , i observe that each GPU slows down - I suppose that it is caused by communication between multiple GPUs and single process on which all threads run. Because of that I tried the same using multiprocessing instead threading to exploit multiple CPU cores (to sign different cores with different GPU) but it gave me an error. Below I attach my test code:

import cv2
from threading import Thread
from multiprocessing import Process
from tqdm import tqdm

def cuda_test(gpu_id, idx_start, idx_end, frames):
    cuda_orb = cv2.cuda.ORB_create()
    for i in tqdm(range(idx_start,idx_end)):
        gray_frame = cv2.cuda.cvtColor(frames[i],cv2.COLOR_BGR2GRAY)
        kp,ds = cuda_orb.detectAndComputeAsync(gray_frame,None)

if __name__ == '__main__':
    img = cv2.imread('1.png')
    frames = [cv2.cuda_GpuMat(img) for x in range(1500)]
    print('\nMultihreading part: ')
    t1 = Thread(target=cuda_test,args=(0,0,len(frames),frames))
    print('\nMultiprocessing part: ')
    p1 = Process(target=cuda_test,args=(0,0,len(frames),frames))

Above’s sample code run only on single thread/process because currently I don’t have access to machine with multiple GPU. When threading part runs successfully, multiprocessing doesn’t work and gives mi this error:

Multihreading part: 
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1500/1500 [00:08<00:00, 166.72it/s]

Multiprocessing part: 
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/", line 315, in _bootstrap
  File "/usr/lib/python3.8/multiprocessing/", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "", line 419, in cuda_test
cv2.error: OpenCV(4.5.1) /home/kaczor/opencv/modules/core/src/cuda_info.cpp:73: error: (-217:Gpu API call) initialization error in function 'setDevice'

I see that there is problem with OpenCV CUDA support for multtiple processes done by multiprocesing . However, I am not able to find a reason why exactly it happens and how to fix it… Does anyone has any idea how to efficiently split task like this between multiple GPU to avoid situation like with multiple threads from threading where each GPU slows down?

The reason why exactly it happens is because it is not possible to use CUDA in a child process created by fork(), if CUDA has been initialized in the parent process.

So the first step in fixing the problem might be to get any usage of cv2.cuda out of main, before creating the child processes. However that might not be sufficient if import of cv2.cuda (by itself) initializes CUDA. It is possible to use CUDA in python multiprocessing, but I don’t happen to know if it is possible with cv2.cuda (The previous link suggests to me it is possible with the non-CUDA-built OpenCV.)

Note that β€œgetting rid of CUDA initialization in main” probably also includes the removal of your multithreading test, prior to the multiprocessing test. Here’s a simplistic example:

$ cat
from threading import Thread
from multiprocessing import Process
from numba import cuda

def cuda_test(gpu_id, idx_start, idx_end, frames):

if __name__ == '__main__':

#    print('\nMultihreading part: ')
#    t1 = Thread(target=cuda_test,args=(0,0,1,0))
#    t1.start()
#    t1.join()

    print('\nMultiprocessing part: ')
    p1 = Process(target=cuda_test,args=(0,0,1,0))
$ python

Multiprocessing part:

My version of numba is nicely explicit. If I uncomment the multithreading part of the test case above, I get an error message like this:

numba.cuda.cudadrv.error.CudaDriverError: CUDA initialized before forking
1 Like

Thank you for the explanation.