Cuda context over multiprocess or threading with 3 cameras

magnus.gabell · August 31, 2022, 12:57pm

Hi,

I would like to know if anyone has been able to get a fast producer consumer pattern for Camera/Cuda context for AI-image recognition.

My problem is that I have 3 cameras that should be analysed at 60FPS. Disregard weather it is possible or not due to the AI-processing time; the thing is that I can’t have latency or lag while recording to disk.
So I use a Gstreamer picking up the camera and performing AI. This is then put into a List-array with three positions, 0, 1, 2 containing image and index.
These are then picked up by the Visualization thread that will show on screen. Most of the times I get it working but now and then I get FPS latency. After turning off the AI analysis and only record images to disc and show on screen it does not work either. My fear was that the GIL is interfering with the image-transfer of the shared memory used. So either I work with the memory in the wrong way or I am in luck sometimes and the synch works between the threads.

My idea was then to split the camera->AI->Screen into three separate processes instead of threads. Is this duable to pass the cuda-context into the three multiprocesses and perfom the AI-analysis?

In my first program, I have now, I pass the cuda.context() as parameter to the camera threads that will run AI. Though it alerts that I cant share it across multiple… (dont remember the error in full). It seems to work almost.

Also… I have tried to figure out the thread lock mechanism for the List-array but I cant understand if it I need a _lock _unlock mechanism or if that works on its own somehow like threading.Queue().
I have a class containing the list and then in my main I declare it as a list array then pass that as param when I start the producer threads and consumer thread and from which the visualization reads from each list position depending on camera 0, 1 or 2. Then shows it on screen.

I am at a cross road and cant figure out where to go. The below reference I understand that its impossible to share the same cuda context to the multiprocessing in python. Shall I then instead create three different contexts to the same GPU? Is this even possible?

I hope I make sense in this… :D
Continuing the discussion from Camera.capture getting stuck when using multiprocessing:

dusty_nv · September 1, 2022, 1:20pm

Hi @magnus.gabell, I can’t really speak to the intricacies of using multiprocessing with CUDA and camera acquisition in Python, but perhaps some other community members may be able to share their experiences and knowledge.

My personal take is that DeepStream is optimized for multi-camera capture and has Python APIs, or for high-performance multithreaded applications I would typically use C++.

magnus.gabell · September 1, 2022, 1:29pm

Thanks,

I would also like to transition into c++, but since I have to stick with Python :-|

I have decided to try the threaded version still. I hope that I can get the cuda context into all three threads and use inference and leave out all the shared-memories and queues and what not. if that works I think I will know that it is the GIL who messes with me. I manytimes get strange behavior.
Question still though: If I dont close the cuda context properly, can I get some memory leak or resource issues due to that? Like slow performance etc or will it be reset next time I start my python application?

dusty_nv · September 1, 2022, 4:13pm

Do you have a threaded version working that doesn’t use CUDA as a proof-of-concept? i.e. passing the images around as Numpy arrays or whatnot. That could help you iron out the queueing/locking issues and isolate it from the CUDA stuff.

It should automatically be closed after the process has, the kernel drivers track the resources and will release them.

magnus.gabell · September 2, 2022, 7:02am

Thanks. I tried to remove all AI things.
This is where I was a bit unscientific unfortunately. I created a base-program with a queue and producer consumer pattern. That works fine with one camera, but first initial try I got to many images in the queue so obviously movement happened some time after on screen. But then I started building a new program from scratch and committed each step. First start threads and read image, then added cv2.imshow then finally added AI. All Cam->AI->Screen happens in one thread so I removed the memory sharing. This seems to work fine now. If I can I will create a queue and separate the CAM->AI from Visualization. BUT now I get

(python3:13176): GLib-GObject-CRITICAL **: 08:56:33.817: g_object_ref: assertion ‘G_IS_OBJECT (object)’ failed

I have seen this commented in other threads but not related to Jetson NX and TensorRT/Cuda. It happens after maybe 10-15 seconds. I will keep you posted.
Perhaps I will close this thread and do as you say, Proof-of-concepting for threading then open another for the error.

dusty_nv · September 2, 2022, 4:04pm

OK gotcha - if you are using cv2.imshow(), in my experience that is not very high performance (but easy to use for a quick visualization). I take it that you still have each camera running in it’s own thread - are you using cv2.VideoCapture() or using GStreamer directly? I’m not exactly sure what would cause that GStreamer error - does that camera fail to capture after that happens?

magnus.gabell · September 5, 2022, 10:52am

I have still three threads, one per camera.
So it goes in the threads like this:
CAM → AI → Visualization. all sequencial.

I use cv2.VideoCapture(gst-string, cv2.CAP_GSTREAMER)

When the error occurred everything crashed back to terminal.

magnus.gabell · September 7, 2022, 9:11am

Hi, do you have a faster way than imshow? I have tried to get three cameras working and
I get all issues in the world. Either its lagging, or I get multi-thread exceptions because I didnt use thread.lock. Then I get xInit errors so I tried

import ctypes
ctypes.CDLL('libX11.so.6').XInitThreads()

But this just slows down everything.
I tried threading.queue and that also creates a lag.
I just dont understand how to get three cameras @ 60fps 1600x1300 real time in python
Seems impossible

dusty_nv · September 7, 2022, 1:47pm

@magnus.gabell I would recommend looking into using DeepStream for this as it is optimized for multi-camera high-throughput applications and has Python APIs. You can find the DeepStream Python samples here:

magnus.gabell · September 7, 2022, 3:04pm

I will look. BUT the problem is that I run Tensorflow 1.15 with TensorRT model of a Resnet50. So I cant update my NX to any higher because that totally messes up my game. But I will check out the deepstream cause it seems to be the thing, when looking into these topics.

thanks

magnus.gabell · September 8, 2022, 1:52pm

Hi,
Trying to install Deepstreem 6.0.0.1-1 but I get an error that libnvvpi1 is missing. After cleaning the install of deepstream and try to install the lib, it cant be found. trying to follow the installation procedure found at nvidia but no luck.

//Magnus

dusty_nv · September 8, 2022, 3:55pm

I would make sure that the version of DeepStream that you are using is compatible/intended for the version of JetPack-L4T that you are running, or use the deepstream-l4t container. You can also post your issue to the DeepStream forums so they can help you out with that.

magnus.gabell · September 12, 2022, 8:23am

I think this is caused when threads collide reading from same memory.
Proper thread management solves it.

Secondly I found that readkey() from terminal in main program collide with imshow window and the synchronization messes up the speed and camera lag.

I will create a producer consumer example and add here.

dusty_nv · September 12, 2022, 8:17pm

Okay, certainly you are correct that if you weren’t using locks/mutexes to guard the memory when multiple threads were accessing the same memory, that would likely be an issue. These multithreading issues tend to present themselves in ways that are non-deterministic and their behavior can vary from run to run, as the OS schedules threads differently each time (i.e. depending on processor load, context switching, ect)

If you make progress on your application, let us know - in the meantime, best of luck!

magnus.gabell · September 19, 2022, 8:16am

Thank you! I will post my striped down solution when I have it done and also in stack overflow.

system · October 12, 2022, 3:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot get any stream parallelism. CUDA Programming and Performance	13	1297	December 31, 2019
jetson-inference python multiprocessing Jetson Nano	11	3484	October 18, 2021
How I can parallel access the meta data for each stream? DeepStream SDK	18	172	August 9, 2024
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12734	January 24, 2011
CUDA consumption from ARGUS API 6 Cameras Jetson TX1	5	951	December 20, 2017
TensorRT multi stream TensorRT	3	2680	February 29, 2024
GPU Context Switching Issue in DeepStream with CuPy Post-Processing DeepStream SDK cuda , cupy , jetson , deepstream	21	116	March 10, 2025
How to store raw frame in cuda into thread queue DeepStream SDK deepstream	7	31	February 6, 2025
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2936	October 18, 2021
3 camera @60FPS 1600x1300 flicker issue Jetson Xavier NX camera	19	953	October 26, 2022

Cuda context over multiprocess or threading with 3 cameras

Related topics