Retrieving bboxes coordinates outside of the buffer probe in deepstream_faciallandmark_app.cpp

Hello,

I want to put a stream who is a concatenation (using concat) of the bounding boxes cropped (using videobox) and upscaled (using videoscale) as an input to the second detector(sgie) in the pipeline so it would look like this (first image attached) . I found the coordinates of bboxes “obj_meta->rect_params” in the pgie_pad_buffer_probe. But the problem is, how can I retrieve them to apply the videobox + videoscale transformations and then put these videoboxes inside of the pipeline ?

• Hardware (T4/V100/Xavier/Nano/etc) : Jetson AGX Orin
• Network Type : FPEnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Thank you,

Matéo

Can you share why you did this?

  1. All the deepstream plugins work on GPU or VIC. So this means that if you want to use videobox/videoscale, you need to copy the memory from GPU to CPU.
  2. Why do you need to crop the object from video frame at upstream of sgie? Sgie can inference objects directly.

What kind of improvements do you want to make to this sample?

Hi,

We want a solution that detects wether people filmed by the camera are speaking or not. But when testing the faciallandmarks app, we observed that when the faces were at a distance of 2.50 meters and more, when people were talking their facial landmarks around the mouth were not moving. But when people were near the camera (>2.50 meters), the detection was totally fine.
So the idea is to put a stream after the bbox detection with only the cropped and upscaled bounding boxes as an input for the landmarks, in order to have a better detection of the points around the mouth when people are more than 2.50 meters away from the camera.

Extracting and scaling the object may not achieve what you need.

You can try to scale the whole video frame using nvvideoconvert after decoding.(use the dest-crop and src-crop properties).

Improving the accuracy of the model may be a better way.

Okay so it is not the cropped objects which must be scaled but the whole frame. Just to confirm, the nvvidoconvert element you are talking about and that i sould modify is this one ? GstElement *nvvidconv; And would this scaling of the whole frame give the same result as if we change the MUXER_OUTPUT_WIDTH, HEIGHT and tiler_rows, columns ?

Thank you

By the way, in the case of dynamic cropping and upscaling of the bounding boxes before feeding them to the sgie, can I use nvvidconv or should I use nvdspreprocess to perform it successfully ?

Thank you

There are many objects in the video frame, and it is not possible to scale a smaller object alone.
So, I think scaling the whole frame before inference might solve your problem.

You can do the scaling directly after decoding, adding nvvideoconvert before nvstreamux instead of the existing one in your code.

MUXER_OUTPUT_WIDTH, HEIGHT have nothing to do with the above, they are parameters of nvstreammux

Using nvdspreprocess to scale may require you to modify the code of the nvdspreprocess plugin

Okay so I should scale right before inference. This nvvidconv element is just after decoding and just before streammux. Therefore here I added the lines
g_object_set(G_OBJECT(ds_source_struct>nvvidconv),“src-crop”,“0:0:1920:1080”, NULL);
g_object_set(G_OBJECT(ds_source_struct>nvvidconv),“dest-crop”,“0:0:3840:2160”, NULL);
and when I change those parameters I see some changes on the output video. But I am not sure to understand how it would solve our problem, could you give examples of parameters which may help ?

Thank you

If I understood, using nvdspreprocess, dynamic cropping is already suported but I should implement the scaling part ?

Thank you

Stretch the ROI to the entire image, and the bbox of Object will become larger.
As you described, if the bbox of the face is too small, the accuracy will decrease.

Sorry for the late reply. I added the lines here
g_object_set(G_OBJECT(ds_source_struct->nvvidconv), “src-crop”, “0:0:3840:2160”, NULL);
g_object_set(G_OBJECT(ds_source_struct->nvvidconv), “dest-crop”, “0:0:3840:2160”, NULL);
(the stream is in 4k) and the points around the mouth were still not moving within a ditance of 4 meters. The problem persists. The crop is applied because when I change the parameters, the output is also changed.

But when testing the app with faces from 3 to 7 meters away, we noticed that when we performed a camera zoom manually on the face, the landmarks were detected and seeing the points moving we could identify if the person was talking or not. What do you recommend ?

Thank you

I mean stretching the ROI to the entire frame. For example, the following code will stretch the ROI to 4k.

g_object_set(G_OBJECT(ds_source_struct->nvvidconv), “src-crop”, “960:540:1920:1080”, NULL);
g_object_set(G_OBJECT(ds_source_struct->nvvidconv), “dest-crop”, “0:0:3840:2160”, NULL);

Without cropping, this method should not be able to get the expected results.

This may focus more on a zone but wouldn’t the fact that the nvvidconv is placed before the bbox detection (so the crop is invariable) make this method work only if we already know when the faces are located on the frame ?

Thank you

If you care about the entire 4k area and not just the ROI, it may not be possible to zoom in further due to hardware limitations.

What value do you set the width/height of nvstreammux?
nvstreammux will scale the frame when forming a batch. You can set the value of the nvstreammux width/height property to 3840x2160,
This will also affect the bbox size

Yes because in our cases the faces may be at any place in the image. I set the nvstreammux parameters to 3840*2160 already

This may require a bit of trickery. Divide one frame into mutiple images form a batch.
Just like the following pipeline.

                       | --> nvvideocovert (src-crop/dest-crop)
uridecodebin --> tee -->                                         --> nvstreammux(form batch)..
                       | --> nvvideocovert (src-crop/dest-crop)
                       | --> nvvideocovert (src-crop/dest-crop)

This is a similar topic.

Okay thank you I will check about this

By the way, I cannot reach the page https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/blob/master/apps/tao_others/deepstream-faciallandmark-app/ anymore, it seems Nvidia removed the faciallandmarks and gaze apps of the repository ?

DS-7.1 base on TensorRT 10. Some TAO models lack support, so they have been removed from the DS-7.1 branch.

Please use the DS-7.0 branch.