Running multiple models for different purposes including pre/post-processing and feeding their outputs into each other on GPU

• Hardware Platform: GPU
• DeepStream Version: 5.0.0
• TensorRT Version: 7.0.0.11
• NVIDIA GPU Driver Version (valid for GPU only): 460.32.03

I would like to map a face recognition pipeline to deepstream and make extensive use of the GPU. So right now in the SDK it’s possible to provide a model to produce bounding boxes and additionally a function to parse those and forward those boxes to another element in the pipeline in charge of displaying it on that actual frame. First question is here: To where is the output of the bbox parsing before forwarded? Right now only the execution of the model is being executed on the device. I would like to have as much as possible to be running on the GPU resp. having as less as possible context switches.

What would be the best approach to get something like this mapped into DS:

Frame → Detection [PGIE] → post processing on each detected bbox → feed every post-processed finding into another model [SGIE?] → second post processing on model output and pre computed data → feed everything into another model [another SGIE?] → output (ideally here is the switch back to host)

would something like this be possible? how would i start on this?

Deepstream has already encapsulate such things inside.

Deepstream SDK is higher level API than the pipeline you give.
nvinfer plugin has included pre-processing, inferencing, post-processing inside it.

The correct deepstream pipeline is:

Input frames → nvstreammux(combine frames into batches) → nvinfer(PGIE for detection) → nvinfer(SGIE for recognition) → output

Please learn some basic gstreamer knowledge(https://gstreamer.freedesktop.org/) before you start with deepstream.

Please refer to Deepstream SDK document Welcome to the DeepStream Documentation — DeepStream DeepStream Version: 5.0 documentation for how to start with deepstream.

Ok, Say I’d like to map my pipeline to the one you sketched:

Input frames → nvstreammux(combine frames into batches) → nvinfer(PGIE for detection) → nvinfer(SGIE for recognition) → output

Is it possible to add a post-processing to PGIE as the model used in SGIE expects its input to be in a certain format which is in general not provided by PGIE?

You need to tell us what kind of format SGIE needs

The model expects an image in the format 112x112. However it also expects the image to be cropped, scaled and aligned which I would like to somehow add in the post-processing of PGIE. Actually for alignment a landmark detection is necessary which is another inference which would not really fit in the PGIE/SGIE thing.

nvinfer can do crop, scaling and normalization before sending data to the model. You don’t need to do it with PGIE post-processing.

What does this mean?

I guess everything sent to SGIE is the cropped detections from PGIE resp. the cropped bounding boxes? How would you scale those cropped patches before sending it to SGIE? Is there an example available?

What does this mean?

So what I would like to do is, take the cropped and scaled bounding boxes and perform a landmark detection on those (which would be another inference between detection and recognition). Then according to the landmarks found on each bbox do a transformation on those bboxes and finally feed those transformed bboxes into the recognition.

We have encapsulated all functions into nvinfer plugin. The crop, scaling, transformation, etc are all done with GPU. The nvinfer source code is available /opt/nvidia/deepstream/deepstream-5.0/source/gst-plugins/gst-nvinfer

What sample do you need? We already have a back-to-back sample which use PGIE + 3 SGIE to detect cars and identify the car color, car manufactures, car types. deepstream_reference_apps/back-to-back-detectors at master · NVIDIA-AI-IOT/deepstream_reference_apps · GitHub

As I have said, nvinfer has already supported crop, scaling, …, you don’t need to care about it any more. Is the “landmark detection” your PGIE model? How many models will you use in your application? Can you explain the relationship between these models? Every model needs a nvinfer plugin instance to be added to the pipeline.

The deepstream-test2 sample is also useful sample for you to use multiple models.
/opt/nvidia/deepstream/deepstream-5.0/source/apps/sample-apps/deepstream-test2

Is the “landmark detection” your PGIE model? How many models will you use in your application? Can you explain the relationship between these models?

Thanks for your help and patience so far. I’m wondering if this is actually the right approach. I try to sketch you as best as I can what I would actually like to have:

My PGIE model is a detector which detects objects. I would like to take those detections, crop them, scale them and feed them into another model; the landmark detector(a regressor), which finds features (set of [x,y] coordinates for each detection). According to the landmarks found for each detection I do a transformation on each detection and feed the transformed detection into another model, the recognition, which is another regressor. So I don’t really feed model 2’s output into model 3, but want to use it’s output to alter model 1’s output and feed it into model 3. So this is not really a cascade.

I guess PGIE/SGIE work as a cascade? I’m just wondering if it’s possible to cascade SGIE’s or, if you have multiple, if they run in parallel. Meaning that you cannot feed an SGIE into another SGIE which is the same as: all SGIEs receive the same input from its PGIE?

Hope this is somehow understandable.

Can you explain the input and output of the three models too?

What is the detector’s output? The detector we define in deepstream means to find out some objects and output the coordinates of the objects, but not just a inference model.

This will be done in the regressor SGIE pre-processing. You don’t need to take out it as a separated step.

What kind of transformation? Can you elaborated it clearly? Since the deepstream elements handle HW buffers, we need to evaluate whether this can be done with deepstream.

It is OK with deepstream. deepstream-test2 works in similiar way. PGIE detect cars and output coordinates, nvinfer instance for SGIE gets the coordinates and crops the cars out, scaling the car image to model input size and then feed the images to classifier model. All things are done inside nvinfer.

It is OK to be not cascade with deepstream

Actaully SGIE mean do the inference with just some parts of the video. PGIE means do inference on the whole video. Deepstream currently can only do inference on video/image data.

For every batch, the multiple models works one by one, but for the whole video, when SGIE handles the 20th batch, PGIE may work on the 25th batch, in this point, PGIE and SGIE work in parallel. SGIEs can receive differnt input according to diffrent configuraions. Every nvinfer instance needs a specific configuration file, so every thing is configurable.

It is better to read deepstream document first to know about some basic concept of deepstream.

What is the detector’s output? The detector we define in deepstream means to find out some objects and output the coordinates of the objects, but not just a inference model.

Well, yeah, thats what I mean. My PGIE also outputs coordinates of objects.

What kind of transformation? Can you elaborated it clearly? Since the deepstream elements handle HW buffers, we need to evaluate whether this can be done with deepstream.

An affine transformation.