OpenCV application uneven frame times

Hi,
I am developing an OpenCV based application, that reads frames from a connected camera, it does some opencv processing, and displays the feed on the screen. As simple as that.

Now my problem is that the video feed is not always smooth, but from time to time it will lag behind and then jump back to synchronization. I don’t really understand how can an opencv app lag behind - I thought cv::VideoCapture would always read the latest frame, so if anything, it can be choppy if the device doesn’t have enough performance, but how can it lag half a second behind?

Anyway, most of the time it would run in sync but there are these glitches every, say, 5 seconds or so.

Now I added some debug output. Most of the frames will take about 4 ms to compute, but at times, the frame computation takes about 15 ms which is probably where it gets out of sync.

Could it be some background process spiking the performance? Not sure which though, as this is a vanilla Jetpack 4.6. I haven’t installed anything else.

Also I am using UMats everywhere I can to benefit from OpenCL.

I have also tried switching between different performance profiles. 6-core 20W seems to have the most reliable performance, which was a surprise to me as I don’t think there’s anything in the app that would benefit from multiple cores. On the other hand having a higher clock - which is the case of 2-core 20W profile - should be beneficial, which it in fact isn’t. Well, sort of. The frames get calculated faster, but the out of sync issue seem to happen more often.

Thanks for any tips you can give me!
Jan

Hi,
For information, do you use v4l2src or nvarguscamerasrc plugin for the camera source? Would like to what the camera type is.

Hi, I am not sure. I open the camera simply as:

        cv::VideoCapture cap;
        int deviceID = 0;             // 0 = open default camera
        int apiID = cv::CAP_ANY;      // 0 = autodetect default API
        // open selected camera using selected API
        cap.open(deviceID, apiID);

The camera is a Boson FLIR thermal camera.

I can do a minimal example if that helps.

Hi,
The code should use v4l2 capture in OpenCV. Probably CPU loading is too heavy sometimes and impacts overall performance. You can execute sudo tegrastats to get system status.

Not sure if there is improvement, but you can try to put gstreamer command in cv2.VideoCapture():
V4l2src using OpenCV Gstreamer is not working in Jetson Xavier NX - #3 by DaneLLL

Some things to note from personal experiance:

  1. OpenCV default backends usually buffer the stream. So no, it doesn’t always return the newest frame. That is why you should run the input in a separate thread. That way the input is always handled even if the rest of the code slows down. There are some options if you use the gstreamer backend (can put gstreamer pipeline in cv::VideoCapture constructor) by dropping frames there.
  2. Jetson doesn’t really support OpenCL, so UMat probably doesn’t do anything. If you want to do some things at higher performance you will need to use either VPI or OpenCV CUDA backend (which requires recompilation of OpenCV as the default OpenCV packed with jetpack isn’t compiled with CUDA for no reason).

Thanks to both. Will try the suggestions.

But wait a second Dalus, are you sure there’s no OpenCL support? Some time back I had this thread regarding the up-sampling performance:

Back then I came to a conclusion OpenCL helped me get over my performance problems. But maybe what I was seeing was the effect of having multiple CPU cores working on it in parallel. Could it be? Say I call cv::resize and assume there’s no OpenCL HW acceleration. It would still try to use multiple CPU cores, wouldn’t it?

But what about displaying the image in a larger resolution window. Would that up-sampling use just a single CPU core? That could explain a lot…

Sigh… Why can’t nVidia just bundle their OpenCV with CUDA support. People are complaining about it all the time.

Hi,
On Jetson platforms, we would suggest try VPI. Please check explanation in this topic:
Trying to get OpenCV (built with CUDA) working with FFMPEG - #6 by DaneLLL

For cv::resize, it is supported in hardware converter VIC. Please try nvvidconv plugin in gstreamer, or NvBufferTransform() in jetson_multimedia_api.

Hi,
On Jetson platforms, we would suggest try VPI. Please check explanation in this topic:
Trying to get OpenCV (built with CUDA) working with FFMPEG - #6 by DaneLLL

For cv::resize , it is supported in hardware converter VIC. Please try nvvidconv plugin in gstreamer, or NvBufferTransform() in jetson_multimedia_api.

Yeah, but VPI doesn’t and in the near future will not support most of the OpenCV algorithms. And I think Nvidia made most of the CUDA backend for OpenCV anyway. And previously someone from NVidia said that they cannot compile OpenCV + CUDA for Jetpack because of some compatibility reason, but as far as I know recompiling OpenCV + CUDA works fine out of the box.
Most people will do that anyway, because it is usually required for efficient DNN pre and post-processing. So the only reason you might not want do to it now is because you don’t have the storage space for extra few hundred mb. Jetson have very little storage and that is a constant pain.

But wait a second Dalus, are you sure there’s no OpenCL support? Some time back I had this thread regarding the up-sampling performance:

OpenCV has a lot of backends, so it is possible it switched to multi-core. But threading is usually done also for regular cv::Mat - not sure why cv::UMat made a difference there. I have also done per-pixel processing with OpenMP on Jetson and that seemed to work fine.

I suggest recompiling OpenCV with CUDA and using that. And if possible then try VPI.

Thanks guys,
I just wish I knew all this when starting with the project and not now when it is already done, I am out of budget and just trying to cope with “minor performance issues”.

BTW this is the first time I hear about VPI. Scrolling through the forums here, it seems vast majority of people seem to be using OpenCV anyway. Well, it’s well established, so no wonder…

So after thinking thoroughly what you’ve both written above, I think I’ll try (in this order):

  1. Use GStreamer pipeline and see if I can get it to drop frames not to lag behind.
  2. Try to put the readout in a separate thread and synchronize the usage with some mutex and shared pointers to always use the latest frame.
  3. Try to build with CUDA support (BTW can you maybe provide some up to date instructions for 4.6 that works the best for you? There are many threads and I already got lost once several months ago)
  4. Try out VPI. It looks like the thing I need, but is rather involved. Also I don’t do any advanced OpenCV stuff, but the one thing I am not sure I can do the same way with VPI is camera calibration. I saw there is some distortion warping support in VPI, but my calibration relies on OpenCV calibration matrix. I use cv::initUndistortRectifyMap and cv::remap to compensate for the massive thermal camera barrel distortion.

Thanks again

VPI has a good OpenCV compatibility layer that allows using it with OpenCV Mat’s quite painless. The initUndistortRectifyMap should be equavilent to the “Polynomial Distortion Model” in VPI.

First you need to make VPIPolynomialLensDistortionModel struct (VPI - Vision Programming Interface: Lens Distortion Correction) and fill in the distortion coefficients. OpenCV distortion coefficients are in order: k1,k2,p1,p2,k3,k4,k5,k6 if you have 8 of them (OpenCV: Geometric Image Transformations).

Then you need to call vpiWarpMapGenerateFromPolynomialLensDistortionModel with the VPIPolynomialLensDistortionModel struct as well as the camera intrisics (VPI - Vision Programming Interface: Lens Distortion Correction) which can be essentially gotten from OpenCV Camera matrix (you can see the matrix here: OpenCV: Geometric Image Transformations). And the extrinsic matrix would be an identity matrix.

NVidia could make some basic functions to convert between OpenCV matrices to the ones they use, but I guess it’s simple enough for them not to bother. The only real place to mess up is mixing up row-major or column-major ordering, but even then it’s usually possible to get it at the second try.

Hi, so I have been messing around my code for a couple of days and here are my findings so far.

@DaneLLL Here are the tegrastats during the app running:

RAM 1661/7765MB (lfb 1165x4MB) SWAP 0/3883MB (cached 0MB) CPU [59%@1420,55%@1420,30%@1420,26%@1420,23%@1420,25%@1420] EMC_FREQ 12%@1600 GR3D_FREQ 33%@204 APE 150 MTS fg 0% bg 3% AO@42.5C GPU@42.5C PMIC@50C AUX@43C CPU@44.5C thermal@43.3C VDD_IN 5112/4423 VDD_CPU_GPU_CV 1554/1131 VDD_SOC 1265/1160
RAM 1662/7765MB (lfb 1165x4MB) SWAP 0/3883MB (cached 0MB) CPU [43%@1420,68%@1420,29%@1420,20%@1420,27%@1420,27%@1420] EMC_FREQ 12%@1600 GR3D_FREQ 33%@204 APE 150 MTS fg 0% bg 4% AO@42.5C GPU@42.5C PMIC@50C AUX@42.5C CPU@44.5C thermal@43.1C VDD_IN 5071/4438 VDD_CPU_GPU_CV 1554/1141 VDD_SOC 1265/1163
RAM 1661/7765MB (lfb 1165x4MB) SWAP 0/3883MB (cached 0MB) CPU [49%@1420,50%@1420,34%@1420,37%@1420,29%@1420,22%@1420] EMC_FREQ 12%@1600 GR3D_FREQ 33%@204 APE 150 MTS fg 0% bg 4% AO@42.5C GPU@42C PMIC@50C AUX@42.5C CPU@44C thermal@42.8C VDD_IN 5071/4452 VDD_CPU_GPU_CV 1554/1150 VDD_SOC 1265/1165
RAM 1661/7765MB (lfb 1165x4MB) SWAP 0/3883MB (cached 0MB) CPU [27%@1420,27%@1420,55%@1420,24%@1420,32%@1420,49%@1420] EMC_FREQ 12%@1600 GR3D_FREQ 24%@204 APE 150 MTS fg 0% bg 7% AO@42.5C GPU@42C PMIC@50C AUX@42.5C CPU@44C thermal@42.8C VDD_IN 5071/4465 VDD_CPU_GPU_CV 1513/1158 VDD_SOC 1265/1167
RAM 1662/7765MB (lfb 1165x4MB) SWAP 0/3883MB (cached 0MB) CPU [35%@1420,33%@1420,28%@1420,32%@1420,47%@1420,49%@1420] EMC_FREQ 12%@1600 GR3D_FREQ 31%@204 APE 150 MTS fg 0% bg 6% AO@42C GPU@42C PMIC@50C AUX@42.5C CPU@44C thermal@42.95C VDD_IN 5152/4480 VDD_CPU_GPU_CV 1594/1167 VDD_SOC 1265/1169
RAM 1661/7765MB (lfb 1165x4MB) SWAP 0/3883MB (cached 0MB) CPU [29%@1420,33%@1420,25%@1420,27%@1420,79%@1420,47%@1420] EMC_FREQ 13%@1600 GR3D_FREQ 40%@204 APE 150 MTS fg 0% bg 7% AO@42C GPU@42C PMIC@50C AUX@42.5C CPU@44C thermal@42.8C VDD_IN 5275/4496 VDD_CPU_GPU_CV 1676/1178 VDD_SOC 1306/1172

So it definitely looks like there is some multi-threading going on.

@Dalus Using gstreamer pipeline actually helps (at least to some extent) to get rid of the occasional delay. This is the VideoCapture I use at this point:

cv::VideoCapture cap("v4l2src device=/dev/video0 ! video/x-raw, width=640, height=512 ! videoconvert ! video/x-raw,format=BGR ! appsink max-buffers=1 drop=True");

Where the important part is max-buffers=1 and drop=True. I googled that out, but honestly I must say I don’t understand why are those listed after the sink and not before that in the pipe… Can anyone explain?

I have also began studying the VPI interface and I have two questions so far:

  1. All of the examples seem to be doing one-time conversion. I.e. load an image, convert, save and that’s it. In my case, this is a continuous video stream so I am not sure which commands do I need to call just once and which need to be called repeatedly (i.e. each frame). From the terminology - I mean, you nVidia guys call it a stream - it looks like that should be set up just once. So, are there any examples using a continuous video stream?

  2. During that investigation I realized I could probably also use the gstreamer pipeline to upsample my video. That may as well be a good idea because I’d expect that to be HW accelerated. I think I should be using nvvidconv. But I was not able to get the pipeline right. It always comes up with some error. I was trying to figure it out in console and this is the line that I felt most confident about:

gst-launch-1.0 v4l2src device=/dev/video0 ! video/x-raw,width=640,height=512,format=I420 ! nvvidconv !
'video/x-raw(memory:NVMM), width=(int)1280, height=(int)1024, format=(string)I420' ! glimagesink

But no, it doesn’t work. It says:

WARNING: erroneous pipeline: could not link nvvconv0 to glimagesinkbin0, glimagesinkbin0 can't handle caps video/x-raw(memory:NVMM), width=(int)1280, height=(int)1024, format=(string)I420

Any idea?

Thanks guys!

Just to elaborate a bit on the question marked 1):

The examples do have Initialization and processing phase marked by comments. What I don’t understand is how that goes together with reading the frames one after another from cv::VideoCapture. I understand that I cannot free cv::Mat while still having vpi wrapper around them. But cv::VideoCapture::read would probably invalidate the underlying data anyway, right? I just with there was an example…

Also seems to me that in Resampling example, there is a bug:

        {
             VPIImageData data;
             CHECK_STATUS(vpiImageLock(output, VPI_LOCK_READ, &data));
  
             // Lock output image to retrieve its data on cpu memory
             VPIImageData outData;
             CHECK_STATUS(vpiImageLock(output, VPI_LOCK_READ, &outData));
  
             cv::Mat cvOut(outData.planes[0].height, outData.planes[0].width, CV_8UC3, outData.planes[0].data,
                           outData.planes[0].pitchBytes);
             imwrite("scaled_" + strBackend + ".png", cvOut);
  
             // Done handling output image, don't forget to unlock it.
             CHECK_STATUS(vpiImageUnlock(output));
         }

Those first two lines in the block seem to be superflous. The data variable is not used anywhere and another problem is that the output is “double-locked”. Which then only works once on my side, and get’s stuck the second time I try to run the processing part…

And one more question… This VPI framework, it’s Linux only, right? I cannot possibly develop on Windows, right? I haven’t found any explicit mention about this, but haven’t been able to Google anything on the topic, which probably means it’s Linux only. Not a big deal, I just like developing in Visual Studio, that’s all.

Hi, any more suggestions? I’m kind of stuck :-(

Hi,
Please refer to steps in Jetson Nano FAQ and check if you can launch the camera in gstreamer command or 12_camera_v4l2_cuda. On Jetson platforms, we would suggest use either software frameworks to achieve better performance. However, not sure if your camera supports general formats such as YUV422.

There are VPI samples listed in
https://docs.nvidia.com/vpi/samples.html
Please also take a look.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.