Surface vs Frame

• Hardware Platform (Jetson / GPU) Jetson Xavier NX
• DeepStream Version 5.1
• Issue Type Question

Hey there,
I hope you are having a good day.

I am writing a custom deepstream plugin that works as a face aligner. It reads a tensor(face landmarks) from the input buffer’s meta, and uses VPI to apply a transformation on the faces based on the landmark positions.

Here is my question:
Each frame, may have multiple faces. Therefore, for each frame, I have to create a new buffer that stacks the aligned faces. I have done this using this post. However, I am confused on how I should structure the output buffer.

Should I place each face in a surface OR Should I place each face in a frame??

After aligning the faces, I would like to feed them to an nvinfer plugin for feature extraction. How does the nvinfer expect the faces to be structured? A frame with multiple surfaces or multiple faces each having 1 surface with a face?

Here’s where my question applies in code:

While creating the output buffer in gst_batcher_prepare_output_buffer, there is a section in code that creates a number of frame metas each having 1 surface. I am confused whether this is ok for my case or I should create a single frame meta with multiple surfaces?

// Let's say we have 4 faces.
for(int i = 0;i < 4;i++) {
		frame_meta = nvds_acquire_frame_meta_from_pool(batch_meta);
        // Just some parameters, ignore them.
		frame_meta->pad_index = i;
		frame_meta->source_id = 0;
		frame_meta->buf_pts = 0;
		frame_meta->ntp_timestamp = 0;
		frame_meta->frame_num = 0;
		frame_meta->batch_id = i;
		frame_meta->source_frame_width = 100;
		frame_meta->source_frame_height = 100;

        // one surface per frame!
		frame_meta->num_surfaces_per_frame = 1  ;  
		nvds_add_frame_meta_to_batch(batch_meta, frame_meta);


	}

And for the other case (one frame with multiple surfaces), it would be like this:

// No more loops, just one instance.
frame_meta = nvds_acquire_frame_meta_from_pool(batch_meta);
// Just some parameters, ignore them.
frame_meta->pad_index = 
frame_meta->source_id = 0;
frame_meta->buf_pts = 0;
frame_meta->ntp_timestamp = 0;
frame_meta->frame_num = 0;
frame_meta->batch_id = i;
frame_meta->source_frame_width = 100;
frame_meta->source_frame_height = 100;

 // 4 surfaces per frame!
frame_meta->num_surfaces_per_frame = 4 ;  
nvds_add_frame_meta_to_batch(batch_meta, frame_meta);

Thank you for spending time on this.
Best.

Nvinfer needs batched frames which is batched by nvstreammux. Whether there is multiple faces in one frame or multiple surfaces of face in each depends on the network.

I don’t understand what your plugin wants to do. Do you wants to output batched face frames(cropped from the original frames) to downstream nvinfer?

We already have some sample apps with facial landmarks model. You may take a look with them to check whether the sample will be useful for you. https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/tree/release/tlt3.0/apps/tlt_others

Thank you for your quick response.

Yes exactly. Basically, the pipeline is as follows:

[input stream] -> PGIE(detects faces) -> SGIE(detects landmarks and outputs tensor)
-> My Aligner (Reads tensor and faces from meta, aligns each face, outputs faces stacked)
-> SGIE (extracts feature vector from faces)

So my plan is to stack the cropped faces in a way that the next nvinfer (feature extractor) runs inference on each aligned face.

My question is how does the last nvinfer(feature extractor) expect these faces?
1 NvDsFrameMeta with multiple NvBufSurfaces with a cropped face in each?
or
multiple NvDsFrameMeta with 1 NvBufSurface having 1 cropped face?

My first guess is that I should use multiple NvBufSurfaces instead. So in case my plugin wants to work with multistream inputs, I create one NvDsFrameMeta for each stream.

You don’t need to send face pictures to nvinfer. What it needs is just the bbox. When nvinfer works as SGIE, it can crop the faces from the frames with bbox information inside the object meta.

What does this mean?

Yes, this would be applicable if I didn’t want to rotate the faces.

A face in a bounding box may not be upright. It may have rotation (the head might be tilted to left or right). My aligner plugin, rotates the faces to make them upright.

So when my plugin rotates each crop, I should stack these new rotated faces in the output buffer.

My question is how should I place each face in the output buffer.
Should I place each in the SurfaceList? So NvBufSurface->SurfaceList[0] would be the first crop, then [1] would be the next and so on…?

Will the downstream nvinfer in the pipeline process every of these surfaces??

You may need to add the rotation into nvinfer preprocessing part instead of implementing in a new plugin.

Yes, that would be a solution if the rotations were not unique to each face. The rotation angle is computed for each face individually based on the landmarks given by the upstream nvinfer (facial landmark recognition model).

So a specific inference plugin should be implemented. Gst-nvinfer plugin can not meet your requirement.

We have a sample code for gazenet in deepstream_tao_apps/apps/tlt_others/deepstream-gaze-app at release/tlt3.0 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub, the gazenet network need face pictures and facial landmarks for inferencing. So the gazenet inferencing, preprocessing, postprocessing are implemented as a library(based on cuda and tensorrt) and then be integrated with nvdsvideotemplate(Gst-nvdsvideotemplate — DeepStream 5.1 Release documentation) plugin.