Handling Multi-Person Pose Data for PoseClassificationNet Fine-Tuning

I am working on fine-tuning the PoseClassificationNet within the Pose Classification pipeline, and I need guidance on handling multi-person scenarios in video clips during dataset preparation.

Current Workflow:

For single-person action videos, my data processing steps are as follows:

  1. Extract clips from videos with diverse poses and viewpoints.
  2. Run the BodyPose3D model to generate JSON metadata.
  3. Convert 3D points to 2D keypoints.
  4. Convert JSON metadata to NumPy arrays (per video).
  5. Save .pkl files containing the video’s keypoints and corresponding action labels.
  6. Merge arrays and split them into Train, Validation, and Test sets.

Concern:

For videos containing two or more persons performing the same action, I would like clarity on:

  • How should I handle the keypoints of each unique person in a video?
  • What should the NumPy array format look like to support multiple persons?
  • Should I create one combined .npy file per video containing all persons, or separate .npy files per person?
  • How do I assign the correct action label if there are multiple people in a single clip?
  • What is the best practice for splitting Train/Val/Test when multiple persons are present in one video?
  • Are there any NVIDIA-recommended guidelines for handling multi-person action clips in the PoseClassificationNet dataset pipeline?

Additional Context:

  • I am currently following the dataset preparation documentation designed for single-person videos but would like to scale this to handle multi-person cases while preserving the action context.
  • If there are any reference implementations, sample datasets, or scripts for multi-person handling in PoseClassificationNet, please do share.

The input data for training or inference are formatted as a NumPy array in five dimensions (N, C, T, V, M):

  1. N indicates the number of sequences.
  2. C stands for the number of input channels, which is set as 3 in this example.
  3. T represents the maximum sequence length in frames that is 300 (10 seconds for 30 FPS) in our case.
  4. V defines the number of joint points, set as 34 for the NVIDIA format.
  5. M means the number of persons. Our pre-trained model assumes a single object but it can also support multiple people.

You can create separate .npy files per person to maintain clarity and ease of processing. This approach allows for more flexible handling of different actions performed by different individuals in the same clip. However, if the actions are identical and you want to simplify the dataset, you could create one combined .npy file per video , ensuring that the keypoints for each person are correctly indexed in the M dimension.

For multiple people performing the same action in a single clip, you can assign the same action label to all individuals. If different actions are performed, you may need to segment the video into separate clips for each action or use a more complex labeling scheme that accounts for multiple actions per clip.

When splitting the dataset into Train, Validation, and Test sets, ensure that:

  • Diversity is maintained: Include a variety of actions and scenarios in each set.
  • Consistency is preserved: If using separate files per person, ensure that all files for a single video are placed in the same set to avoid overfitting or underfitting due to incomplete context.

While specific NVIDIA guidelines for handling multi-person scenarios are not detailed, the PoseClassificationNet documentation supports multiple persons by adjusting the M dimension in the input array