I trained an ActionRecgonitionNet model by using the provided Colab sample code and exported it to get a .etlt file.
I provided it to my colleague who’s been working on DeepStream. He tried the model and it worked correctly, but he tried it on a video clip containing multiple persons.
The result of the action recognition kept changing as people in the said video clip were doing different actions.
I wonder if it’s possible to do the following thing on DeepStream:
- First, detect all the persons appearing in the image using models like YOLO-v4 or PeopleNet.
- Suppose that N persons are detected as described above and we have their positions. For each person, do action recognition respectively, which means that their appearance on the image is used as the input for the action recognition net. Therefore, if there’re N persons detected, the inference of the action recognition net has to be done N times.