Q&A for the webinar “NVIDIA Pre-trained Models and Transfer Learning Toolkit 3.0 to Create Gesture-based Interactions with a Robot”

Thank you for attending the webinar, “NVIDIA Pre-trained Models and Transfer Learning Toolkit 3.0 to Create Gesture-based Interactions with a Robot”, presented by NVIDIA experts - Ekaterina Sirazitdinova and Nyla Worker. We hope you found it informative.

Please note that the webinar is now available on demand here.

You can also access the github repo here for the sample codes used in this webinar

We received a lot of great questions at the end and weren’t able to respond to all of them. We are consolidating all follow-up questions in the following post.

  1. Can we also retrain the first layers of the network?

In the model config, you can choose to fix weights of batch norm or convolutional blocks in the model config file. More information can be found here- https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/object_detection/detectnet_v2.html?highlight=freeze

  1. Is the quantization error more on the activation function side or does it have a higher impact on the layer size?

QAT affects the values of activation function at each node, and therefore it will affect the layer and model size and computational load. It does not otherwise affect the layer size. Read this developer blog post to learn more about quantization aware training https://developer.nvidia.com/blog/improving-int8-accuracy-using-quantization-aware-training-and-the-transfer-learning-toolkit/

  1. What are the requirements to compile a TRT Model on a host architecture?

For TLT models, you can use tlt-converter to convert from etlt format to TensorRT engine file. Download the appropriate tlt-converter for your hardware and software stack from TLT getting started page.

  1. What is the meaning of numbers in the classification output in the demo overlay?

It is the IDs returned by the tracker.

  1. Is this model file uploaded in the repository?

No, we provide the instructions in the GitHub repository. Models can be downloaded from NGC.

  1. Is the model able to detect gestures as EgoHands - First Person Perspective?

It is able to detect hands from the first perspective, but this will not be ideal as the dataset for gesture recognition was not trained from the first person perspective.

  1. Can the Jetsons support int8 acceleration?

Jetson Xavier NX and AGX Xavier support INT8.

  1. Is this model and repo available in Deepstream python binding?

We only provide the C++ version of the Deepstream app for this example, but the same model can be used with python based app as well.

  1. This method of working with jetson can be used for in-person pose detection?

Yes, check out our developer blog post on pose estimation using Deepstream SDK https://developer.nvidia.com/blog/creating-a-human-pose-estimation-application-with-deepstream-sdk/

  1. Is there an example to extract metadata?

Yes, the reference is provided in the documentation MetaData in the DeepStream SDK — DeepStream 6.1.1 Release documentation

  1. Is there an option to use different pruning techniques in TLT?

No, currently there is only one proprietary pruning method included in TLT.

  1. Does the model detect the hand without a person in the frame?

Yes, the model can detect hand without person in the frame.

  1. Can I create a custom architecture in parallel for many models, for example for face and people, at the same time?

Yes, you can also train a single model with multiple labels. Moreover, the PeopleNet model is already capable of detecting persons, bags and faces.

  1. Can we train a model with live feeds?

Training on live feeds is not supported but you can convert to image and use it as inputs.

  1. Which mAP calculation system is used for your results ? COCO, PascalVOC, another one ?

Pascal VOC 2009 and 2012 VOC. Though this might be model dependent. All the documentation on the metrics and how to configure them are here: https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/object_detection/yolo_v3.html

  1. Is there a way to train a multilabel model?

Yes, you can train multi-label detectors. For example this example detects car, cyclist and pedestrian: https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/object_detection/detectnet_v2.html

  1. Can you count the number of hand gestures and record them into a table or report ? For example, you counted 10 OK signs and 5 Stops?

It is possible, but requires some customizations to the original app.

If we missed any questions, or you have more questions, please feel free to post them in the forum so that we can further assist you.

Here is a link to a video that walks through the jupyter notebook (kitti_conversion.ipynb) from the Webinar that is used to prepare the Egohands Dataset:

1 Like