Understanding settings for secondary classifier

I’m still struggerling with deepstream using python to try out different models. I have some questins that I can’t find the answes to in the deepstream documentation

• Hardware Platform (Jetson / GPU)
Jetson agx xavier
• DeepStream Version
• JetPack Version (valid for Jetson only)
**• Issue Type

  • I’m trying out different models like "facial landmarks"n(NVIDIA NGC)
    I’m not able to fill in the configuration file correctly I need to define the input and output names of the network. But I can’t see it’s documented on the page for the mode. How do I find the input/output names for any etlt model since this is a common problem. I’ve seen your recommendations on looking at other configuration files but this does not actually answer the question. In this case, there is no configuration file.

  • The facial landmark model can output different outputs 68, 80 or 104. How do I define to the which output I would like? Is this set in the configuration file somehow?

  • When looking in in the deepstream examples for eg secondray classifiers like car-make or car-color. I can’t find which function that is converting the tensor output from the model into the metadata structure that is passed to the pad/sinks. When is this done automatically and when do I need to create my own converter? If I eg use my own boundingbox-model how can I re-use the functions you are using? Do I need to write my own converter function for the facial landmark model or is it done magically as the other examples? How do I scale the coordinates to the image?

  • The image is normalized, resized and converted to a tensor to input to the primary detector. But for the second classified, Is the image already normalized, that’s why the scaling factor is set to 1 in the examples for eg car-color?

  • I do not understand how to set the network-type 0: detector, 1 classifier etc…
    It’s not a detector since it does not produce bounding boxes and it’s not a classifier either. does 0 mean that the output is a regression problem and if using 1 it’s converting one-hot encoded outpus to classes? What exaclty is this switch doing?

  • For the primary classifier/Detecotr it’s common to define the input shape of the input tensor (infer-dims=3;160;160) but I can’t find this switch for the examples I’ve fond for the secondary classifier. Does the network detect then input-shapes automatically or when is this switch needed? Is the processing able to both up and down-size an image?

  • I would like to save the metadata to a file in json format and have been looking at the “gst-nvmsgconv” plugin(Gst-nvmsgconv — DeepStream 5.1 Release documentation) . But I can’t understand how to use this to convert it to json. I was expectin that the “payload-type” property would allow me to convert it to eg json since there seems to be different formats. But this property seem to tell how much information that is stored in message. Another question that pops up is that eg “PAYLOAD_DEEPSTREAM” is one setting. but it’s not defined which number that corresponds. i’m guessing it’s 0 or 1, why is this not defined?

  • Regarding the gst-nvmsgconv, where is the final result stored so I can access it using a pad and save it to file?

Firstly, model related questions, please create topic in TLT forum. Latest Intelligent Video Analytics/Transfer Learning Toolkit topics - NVIDIA Developer Forums

The inference plugin gst-nvinfer(Gst-nvinfer — DeepStream 5.1 Release documentation) will convert the model output to metadata. The source code is in /opt/nvidia/deepstream/deepstream/sources/gst-puigins/gst-nvinfer/ and /opt/nvidia/deepstream/deepstream/sources/libs/nvdsinfer. before you investigate the implementation of deepstream, please make sure you are familiar with gstreamer(https://gstreamer.freedesktop.org/) coding skills. Since deepstream by default can support some specific types models which we define into detector, classifier, segmentation and instance segmentation. Facial landmark is not any one of them, so it may need a lot customization to integrate the model with deepstream. It is a must to read and understanding gst-nvinfer source code at least.
What kind of converter do you want? If you want to use your own bbox model, you need to customize the pre-processing of the nvinfer plugin.

Yes. gst-nvinfer will also do the resize, normalization and conversion to adapt to the model input. The factor value is decided by the model, but not deepstream. For the sample car-color model, the normalization factor is 1.

Please refer to Object Detection — Transfer Learning Toolkit 3.0 documentation (nvidia.com)

infer-dims only works with uff model now.

gst-nvmsgconv is a sample and it is open source too. /opt/nvidia/deepstream/deepstream/sources/gst-puigins/gst-nvmsgconv and /opt/nvidia/deepstream/deepstream/sources/libs/nvmsgconv. It also requires you know good gstreamer coding skills before you start with these implementation. E.G. “PAYLOAD_DEEPSTREAM” is a sample which defines some message and format which can be transferred to server. If you go through the code, you will inderstand how it is defined and used to match the massages as defined.

Please refer to the code /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-test4.

Deepstream is only a SDK, you need to understand the interfaces and usages with document and source code.

For question 1, see the reference in How to find the input/output layers names of tlt/etlt model - #8 by Morganh . You can find the command in that shell script.
For question 2, see Facial Landmarks Estimation — Transfer Learning Toolkit 3.0 documentation

Thanks Morganh, but I don’t think those answers any of my questions. I can’t find any methods for finding the input/output of the etlt model. Do you mean that I should use the tlt-converter? but still I need to know the input names to be able to use that one and the gaze-net has several inputs.

regarding question 2, the documentation does only tell me how to setup the model for training. I’m interested in using the pre trained etlt model. What settings have been used?

Why isn’t this documented on the page for the pre-trained model?

Thanks for your answers,
I was thinking of your comment “Deepstream is only a SDK, you need to understand the interfaces and usages with document and source code.”

I almost agree with you, but In my view i was hoping that the documentation should be enough to be able to use existing components. For example the “gst-nvmsgconv” why should I read the source-code to change between different message-formats? Why is there no link to the open-source page of that module?
Another example, the “infer-dims” property that only applies for uff-models. Should I find that out in the source code too? I was expecting this to be documented to help me as a user to know which parameters to use.

I guess we have different views of what should be documented… Sorry for my complaints, I appreciate the help you are giving but I am frustrated that it’s so hard to deploy a model using deepstream when it’s so easy in theory

If the message we have encapsulated can meet your requirement, you can use the plugin directly.
For most users, the message is customized. Different people and different project will need different messages. We provide the interface to contain these messages and transfer the messages, but the generation of the message should be implemented by the user himself. That is why we provide such example. Or you can just refer to the interface of nvmsgconv (Gst-nvmsgconv — DeepStream 5.1 Release documentation) to implement your own message conversion by just ignoring our code.

1 Like
  1. For the input/output name, refer to How to find the input/output layers names of tlt/etlt model - #3 by Morganh

  2. In the doc link I shared with you, please see the table. The num_keypoints support 68, 80 or 104. You can change it in the training spec file.
    num_keypoints, Number of facial keypoints, 68, 80, 104

Further question, please create topics in TLT forum.