Handling Multi-Person Pose Data for PoseClassificationNet Fine-Tuning

aiserviceprovider · March 7, 2025, 6:59am

I am working on fine-tuning the PoseClassificationNet within the Pose Classification pipeline, and I need guidance on handling multi-person scenarios in video clips during dataset preparation.

Current Workflow:

For single-person action videos, my data processing steps are as follows:

Extract clips from videos with diverse poses and viewpoints.
Run the BodyPose3D model to generate JSON metadata.
Convert 3D points to 2D keypoints.
Convert JSON metadata to NumPy arrays (per video).
Save .pkl files containing the video’s keypoints and corresponding action labels.
Merge arrays and split them into Train, Validation, and Test sets.

Concern:

For videos containing two or more persons performing the same action, I would like clarity on:

How should I handle the keypoints of each unique person in a video?
What should the NumPy array format look like to support multiple persons?
Should I create one combined .npy file per video containing all persons, or separate .npy files per person?
How do I assign the correct action label if there are multiple people in a single clip?
What is the best practice for splitting Train/Val/Test when multiple persons are present in one video?
Are there any NVIDIA-recommended guidelines for handling multi-person action clips in the PoseClassificationNet dataset pipeline?

Additional Context:

I am currently following the dataset preparation documentation designed for single-person videos but would like to scale this to handle multi-person cases while preserving the action context.
If there are any reference implementations, sample datasets, or scripts for multi-person handling in PoseClassificationNet, please do share.

Morganh · March 8, 2025, 8:43am

The input data for training or inference are formatted as a NumPy array in five dimensions (N, C, T, V, M):

N indicates the number of sequences.
C stands for the number of input channels, which is set as 3 in this example.
T represents the maximum sequence length in frames that is 300 (10 seconds for 30 FPS) in our case.
V defines the number of joint points, set as 34 for the NVIDIA format.
M means the number of persons. Our pre-trained model assumes a single object but it can also support multiple people.

You can create separate .npy files per person to maintain clarity and ease of processing. This approach allows for more flexible handling of different actions performed by different individuals in the same clip. However, if the actions are identical and you want to simplify the dataset, you could create one combined .npy file per video , ensuring that the keypoints for each person are correctly indexed in the M dimension.

For multiple people performing the same action in a single clip, you can assign the same action label to all individuals. If different actions are performed, you may need to segment the video into separate clips for each action or use a more complex labeling scheme that accounts for multiple actions per clip.

When splitting the dataset into Train, Validation, and Test sets, ensure that:

Diversity is maintained: Include a variety of actions and scenarios in each set.
Consistency is preserved: If using separate files per person, ensure that all files for a single video are placed in the same set to avoid overfitting or underfitting due to incomplete context.

While specific NVIDIA guidelines for handling multi-person scenarios are not detailed, the PoseClassificationNet documentation supports multiple persons by adjusting the M dimension in the input array

Topic		Replies	Views
Training PoseClassificationNet Model TAO Toolkit tensorflow , python , tao , deepstream , nvidia-technologies , jetson-orin	3	32	December 16, 2024
FInetuning of PoseClassificationNet TAO Toolkit tensorrt , ubuntu , gstreamer , tao , deepstream	3	21	March 8, 2025
Integration of Bodypose2d with PoseClassificationNet Models tao	1	30	February 7, 2025
TAO PoseClassification and training data TAO Toolkit python	5	594	February 15, 2023
Training data preparation for PoseClassificationNet TAO Toolkit	13	466	October 10, 2023
Multiple person action recognition pipeline on deepstream with PeopleNet as pgie DeepStream SDK	4	578	March 16, 2023
3D action recognition TAO Toolkit	9	483	September 8, 2023
Workflow containing using multiple models for inference DeepStream SDK	3	251	August 15, 2023
Training PoseClassificationNet for custom dataset TAO Toolkit	3	280	October 5, 2023
How to track and re-identificate human across multicamera scenario using RTSP Metropolis Microservices for Jetson jetson , deepstream , metrop	7	54	February 28, 2025

Handling Multi-Person Pose Data for PoseClassificationNet Fine-Tuning

Current Workflow:

Concern:

Additional Context:

Related topics