TAO PoseClassification and training data

Hi, this post is mostly to clarify some questions I have about retraining the PoseClassification to classify new poses. Right now I am working with three new poses: squatting, fighting and lying down. So, my question is about the input data: “number of sequences”, “maximum sequence length in frames”, and “number of persons”.

  1. In “number of sequences”, this variable represents the amount of elements in my training data. I’m not sure if these sequences refer to sequences in a video, for example, a video of 50 consecutive frames where the same person keeps changing position and constantly changing the skeleton, or if they can also be images without any connection between them and therefore the skeletons in the training data have no connection to each other.

  2. In “maximum sequence length in frames”, it is similar to the previous question. In the documentation, it says “which is 300 (10 seconds for 30 FPS) in the NGC model”, and the documentation also provides the training data, which is a numpy array with a shape of (9441, 3, 300, 34, 1). So, does the first 300 elements correspond to the same person’s skeleton?

  3. In “number of persons”, if I use two videos of 50 frames each, and they have different amounts of persons, how do I manage this? Should there be two files, i.e. (50, 3, 300, 34, 2) and (50, 3, 300, 34, 1)?

I am aware that some of my questions may seem trivial, but I need to be sure of the answers in order to proceed with the development. Any help is appreciated, and thank you in advance!

1 Like

The input layout is NCTVM , where N is the batch size, C is the number of input channels, T is the sequence length, V is the number of keypoints, and M is the number of people.

  1. The sequences stands for consecutive sequences.
  2. See Pose Classification | NVIDIA NGC, if it is a shape of (9441, 3, 300, 34, 1),it means that the maximum sequence length is 300. The number of persons is 1. Yes, the 300 elements correspond to the same person’s skeleton.
  3. Yes, one video can set to (50, 3, 300, 34, 2) and another can set to (50, 3, 300, 34, 1).

Thank you for your prompt response. I have one more question: if I want to use the same video with varying numbers of skeletons per frame, such as a video of a group of people doing a workout, do I need to generate different sets of training data with different numbers of skeletons per frame? Then, would I need to train the model in each distinct dataset? For example, by first processing the video 3 times, resulting in (50,3,300,34,1), (50,3,300,34,2), (50,3,300,34,3) and then training the model in each one? Or is there a different way to approach this?

It is not needed. The model can support training with multiple people.
More, let me clarify the comment previously.

Each sequence has a maximum length (T). For a sequence that is longer than T, it needs to be broken into multiple short sequences to feed into the model. The model will return the predicted action for each short sequence.

N is the maximum number of sequences that the GPU can process in parallel at a time.

1 Like

Now everything’s is clear, thanks you

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.