Performance on PCI-e USB card

I happened to have a 4-channel USB 3.0 PCI-e x4 expansion card SSU_TECH su_u3208.v1. I couldn’t make it run on Linux PC (it supports Windows only), but surprisingly it does run quite well on AGX Xavier.
I have a C++ application using OpenCV VideoCapture() which is well threaded to run and process 4 cameras simultaneously. I plugged 4 USB 2.0 cams AR0144 (1MP), but I get only 35 FPS @ power mode ID 7, 32 FPS @ power mode ID 3 and 60 FPS @ power mode ID 0. Obviously, the CPU frequency is essential, which raises the question for the maximum performance that can be obtained with 4 cams in this setup. For example, can the AGX Xavier run 4 cams AR0234 2MP @ 120 FPS via PCI-e card (assuming the PCI-e x8 Gen4 bandwidth should not be the bottleneck)? Does the MIPI improves the performance significantly (having in mind that CPU power is required) and can it achieve 2MP @ 120 FPS 4 cams? Finally, obviously OpenCV VideoCapture() introduces some overhead. What is the cheapest way to capture in C++?

Jetson AGX offers the PCIe spec-defined bandwidth. You mentioned that the card is a PCIe x4 card but the info about max speed supported by the card is not given. So, I think based on the requirement and speed supported by the card, it should be obvious once we do the math whether your requirement can be satisfied or not.

Hi @vidyas

The card supports 5 Gbps max speed. I intentionally didn’t include this, because I know how to do the math, and I assume I can exchange it with a faster card.

Looking at this specific card and setup, the 4 AR0144 cams theoretically consume 2.7Gbps and I don’t understand why I observe 35 FPS @ power mode ID 7 and 32 FPS @ power mode ID 3. Can you elaborate on this please?

Back to my previous questions - shall I observe any difference between the MIPI cams and the PCI-e cams assuming no bottleneck at the PCI-e slot?

What is the fastest way to capture frames in c++?

Finally, this is a technical question - I know from @JerryChang that

there’re two threads for frame captures, it’s a thread to enqueue sensor frame into capture buffer, another thread to dequeue for the user-space.

and I am doing something similar in my code - I have a producer-consumer design pattern populating a circular queue with frames and processing each frame content using a thread pool:

thread masterCameraThread(&Module_Camera::read, ref(masterCamera), ref(headPtr), ref(cvr1), ref(mr1), ref(rr1), ref(pr1), ref(quit));

pool.AddJob([&, hptr](size_t taskIndex) mutable{masterCamera.processFrame(hptr);});

Now, the problem is I have 4 cams, which use quite a lot of threads under the limitation of 8 physical threads in the AGX Xavier. Obviously, that might cause performance issues and there is a space for optimization here. I am trying to figure out what’s the best approach I can follow based on how the AGX Xavier works. Can you please advice?

I’m sorry but I don’t know much about these ‘power mode ID 7’ and ‘power mode ID 3’?
I’ll loop in someone who knows about the camera stuff.

Thanks! I am talking about Power Mode 0 - 7 from Welcome — Jetson Linux<br/>Developer Guide 34.1 documentation

Regarding my last question - perhaps it would be a good idea to update vi5_fops.c like this this:

static int tegra_channel_kthread_capture_dequeue(void *data)
{
        while (1) {
...
                        buf = dequeue_dequeue_buffer(chan);
                        if (!buf)
                                break;

                        processFrame(hptr);

                        vi5_capture_dequeue(chan, buf);
                }

However, I see no synchronization between cams in here - which I do on my side. Can you please advise the best way to implement this syncronization - i.e. trigger all cams together? Thanks!

hello nouuata,

jumping into this thread, may I know what’s the actual use-case to have 4-cam synchronization?
it’s suggest to have both hardware and software approaches to achieve the synchronization use-case.
please refer to Keeping camera synchronization at software level - #5 by JerryChang as see-also.
thanks

Hi JerryChang,

Thanks for your reply! The use case is a high FPS active stereo vision using multiple cameras and light sources.

I don’t have a hardware background, but according to my research on the NVIDIA partners solutions, hardware synchronization reduces the FPS significantly. This is not acceptable for the problem I am solving, so I am looking at a high FPS software synchronization. The goal is each cam to fill up its buffer of frames and then use a software synchronization (or even frame approximation if needed) to recover stereo properties of the objects of interest. Thanks for pointing out to syncSensor and getSensorTimestamp() . I will try to understand how to sync this approach with the light sources and I will raise questions if needed.

The project requires 1 meter cable cameras, so I am trying to stay away from MIPI cameras at the moment unless there is a significant benefit to use MIPI (SerDes) instead of USB. I am a bit concerned if I can make 4 cams @ 120 FPS @ 1MP work on PCI-e due to the observation I mentioned above and I don’t understand why it is happening:

I get only 35 FPS @ power mode (nvpmodel) 7, 32 FPS @ power mode 3 and 60 FPS @ power mode 0 (MAXN)

Also, should I prefer Argus API over Opencv? I am looking for the least overhead.

hello nouuata,

you cannot enable Argus API to access USB camera,
please check Camera Architecture Stack. it’s libargus to support only MIPI (bayer) sensors.

may I know which JetPack release you’re using?
please share the command-line, also, what’s the reported frame-rate capability by gstreamer?
further more, you may please have a try to exclude multiple-camera access, for example, can you have better frame-rate result with single camera use-case?
thanks

Hi JerryChang,

may I know which JetPack release you’re using?

JetPack 4.4.1

please share the command-line, also, what’s the reported frame-rate capability by gstreamer?

v4l2-ctl --set-fmt-video=width=1280,height=720,pixelformat=RG12 --stream-mmap -d /dev/video0 --stream-count=600 --stream-to=v4l2.rggb
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 60.00 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 60.00 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.80 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.85 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.88 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.83 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.85 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.87 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.82 fps
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59.84 fps

further more, you may please have a try to exclude multiple-camera access, for example, can you have better frame-rate result with single camera use-case?

Yes, switching off cams improves the FPS. For example, two cams can do 60 FPS in any power mode I tried.

The PCIe card can do 5Gbps per channel, so 20 Gbps total. This should be enough for 4x AR0234 @ 2.4MP @ 120 fps unless there is a bottleneck at the CPU which I am trying to figure out.

This is what the code does:

  • Init 4 threads waiting for notification to retrieve the frame from the corresponding camera
  • Init thread pool waiting for notification to process frames
  • while
    ** main thread grabs frames from all cams and notifies the retrieve threads
    ** 4 threads retrieve the frame from the corresponding cam and notify main thread
    ** main thread sends job to process threads to process the retrieved frames and continues

For the purpose of testing processFrame() does nothing currently. I assume I can send the processFrame job to the GPU cores, so that is fine.

The measured performance is:

  • nvpmodel 3:

all fps: 35.8295 / 60
all fps: 35.8166 / 60
all fps: 35.6761 / 60

  • nvpmodel 7:

all fps: 57.4713 / 60
all fps: 58.548 / 60
all fps: 58.1058 / 60

  • nvpmodel 0:

all fps: 59.4884 / 60
all fps: 59.8802 / 60
all fps: 59.8802 / 60

4 threads @ 2188 MHz (nvpmodel -m 7) perform better than 8 threads @ 1200 MHz (nvpmodel -m 3) which raises my concerns.

I am trying to understand why this is happening in first place and potentially figure out if I can obtain 4x 2.4MP @ 120 fps through PCIe USB or I should be looking at MIPI.

hello nouuata,

since you’re having 120-fps sensor,
could you please exclude --stream-to options for checking the sensor stream capability.
for example,
v4l2-ctl --set-fmt-video=width=1280,height=720,pixelformat=RG12 --stream-mmap -d /dev/video0 --stream-count=600

Hi JerryChang,

This is due to the video compression on the camera modules. These modules support MJPG only. I figured it out. Thanks!

One final question - AGX Xavier has a HEVC decoder. Can it be used with USB cams?

Hi,

There is not existing sample for this use-case, but it should work by integrating 12_camera_v4l2_cuda + 00_video_decode. 12_camera_v4l2_cuda demonstrates of capturing MJPEG stream through v4l2 and doing JPEG decoding. 00_video_decode demonstrates video decoding.