Deepstream_tao_apps have degrading performance with multiple detected objects

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Xavier AGX (also tested on Orin)
• DeepStream Version
• JetPack Version (valid for Jetson only)
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs)

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I’ve modified the facial landmark app included in the deepstream_tao_apps repository to use an nvarguscamerasrc. The incoming stream is video/x-raw(memory:NVMM), 1080p, NV12 format, and 30fps.

I’ve set the nvstreammux batch size to 1, live-source is 1, and it’s resolution matches the sources.

I haven’t tweaked anything else in the application. All the probe functions stay the same and I tried rendering the result to the display (This rendering has extremely poor performance (0.3 - 1 fps) but ultimately changed it to nvrtspoutsinkbin and render it on another desktop system where I get a framerate that matches the console output.

What I notice is I get ~24fps when only 1 face is present in the screen, and then it rapidly degrades as you increase the number of faces. So 2 faces = ~12fps, 4 faces = ~6fps, etc. After some reading I saw that I could increase the batch-size of the secondary inference plugin to handle the average number of detections in each frame but this didn’t change the performance. There is also a lot of latency between what the pipeline renders and what the console outputs. Example: I will bring a picture of 3 faces into the frame and the console won’t update the Face Count print statement for ~2 seconds.

I’m not sure why I’m getting such poor performance out of this pipeline. Looking online NVIDIA has evaluated the FaceDetect model to run at 537 fps and FPENet: Facial Landmark Estimator at 1015 fps. That’s a lot of headroom!

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Have you monitored the GPU loading and CPU loading while the pipeline performance drop to 0.3~1 fps?

CPU for the pipeline stays consistent at 77.2%, then the GPU is always jumping around a lot. However, I did notice, with 1 face it hovers between 45% - 91% with an average of about 70%. When 4 faces are present the GPU util jumps between 6% - 78% with an average around 28%. Is there a better way of measuring GPU util? Ideally I’d like a stable GPU util.

Please follow the “system configuration” here to enable the max power. Performance — DeepStream 6.3 Release documentation

I have been running with max performance. I followed the guide and still see the same performance.

I ran a gst-shark profiler to understand the processing times of each element and got the following results with just 1 face in the frame:

The graph isn’t super descriptive but it shows how each element is taking about 100ms of processing time. For instance: vflip is just nvvideoconvert flip-method=4 and it’s processing range takes between 40ms - 160ms!!

If I run just a simple pipeline:
nvarguscamerasrc ! "video/x-raw(memory:NVMM), width=1920, height=1080, format=NV12, framerate=30/1" ! nvvideoconvert flip-method=4 ! nv3dsink
even this nvvideoconvert takes between 15ms - 40ms. I would think this is a seemingly simple operation.

As I test with Xavier AGX DeepStream 6.3. The deepstream_tao_apps/apps/tao_others/deepstream-faciallandmark-app at master · NVIDIA-AI-IOT/deepstream_tao_apps ( FPS is 88.1 fps with 4 faces video(1280x720 @ 30FPS).

I’m constrained to DeepStream 6.0 and JetPack 4.6.3. I ran the default project again and got 12fps for a video with 3 faces (1280x720 @ 30fps).

My end goal is to use the faciallandmark detector with nvarguscamerasrc but I don’t know why all the elements have such high processing time.

Here’s the latency measurement output of the faciallandmarks pipeline adapted to use nvarguscamerasrc as an input. The input stream is 1080p 30fps. I’ve removed nvtiler.

output.txt (199.4 KB)

Have you tried local video with 4 faces?

Local video with 4 faces shows 8.53 fps from the sample deepstream-faciallandmark-app directory. Source video was 1080p 30fps .mkv file.

Do you have other AGX Xavier board?

Yes, I’ve done the same test on both Xavier AGX boards and get the same performance issue.

I just ran the same 4-face video through the deepstream-faciallandmark-app on an new Orin with Jetpack5.0.1 and deepstream 6.1 and got an average fps of 25.10. Obviously better than my Xaviers but still lower than your Xavier performance.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Have you tested with DeepStream 6.3?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.