How to deploy skeleton-based action recognition model to deepstream?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): Jetson Orin NX
• DeepStream Version: nvcr.io/nvidia/deepstream:7.0-triton-multiarch
• JetPack Version (valid for Jetson only): Jetpack 6.0
• TensorRT Version: 8.6.2.3
• NVIDIA GPU Driver Version (valid for GPU only)
• Issue Type( questions, new requirements, bugs): questions

Hi everyone,

I followed DeepStream-Yolo-Pose to run Yolov8-pose on Jetson Orin and it worked. I also added NvDCF tracker so now I can get person id and pose after nvtracker. I’m using Python.

After that, I want to implement ST-GCN action classifier as sgie, which takes an input sequence of these poses with same tracker id and outputs the action class. The model has 2 inputs, their shapes are input1:batchx2x15x17; input2:batchx2x14x17. 17 is number of skeleton joints, 15 is sequence size and input2 is the motion subtraction of skeletons in the input1 so its sequence size is 14.

So, I have some questions:

  1. How can I obtain the input of ST-GCN model I mentioned above which is 2x15x17?
  2. How to implement ST-GCN model as sgie, especially my model have two inputs? Does it have any example similar/related to this task?
  3. Do you guys have any suggestion or recommendation for me about possible approaches to deploy skeleton-based acction recognition on deepstream?

I’m quite new to deepstream so hopefully you guys can help me. Thanks in advanced.

Please consult the author of the Yolov8-pose model for how to make the model output the skeleton joints data as you want.

For how to implement pose classifier model as SGIE, we have a TAO pose classifier model Pose Classification | NVIDIA NGC and the TAO bodypose model BodyPose3DNet | NVIDIA NGC. The DeepStream sample is in deepstream_tao_apps/apps/tao_others/deepstream-pose-classification at master · NVIDIA-AI-IOT/deepstream_tao_apps (github.com)

Thank you, i will check it later.

Actually, the output of Yolov8-pose model is similar to normal Yolov8 model so I can access the skeletons by using

obj_meta = pyds.NvDsObjectMeta.cast(l_obj.data)
data = obj_meta.mask_params.get_mask_array()

So with the ID assigned by the tracker, how can I create the sequence of skeletons as I mentioned (2x15x17) so I can use it for the SGIE? Do you have any suggestion or ideal to do it? Thank you very much.

Are you asking for the algorithm of getting the 17 skeleton joints coordinates from the yolov8-pose output mask data? It depends on the models you are using. Please consult the guys who provide the models.

No, I can get 17 skeleton joints from the Yolov8-pose. But when I cast data from ObjectMeta, I only have the people ID and their skeleton joints in the current frame while the ST-GCN requires 15 consecutive skeletons as input.

Assume that I have a person with ID 1 in the video stream, how do I stack the skeleton joints of this person into sequence of 15 so I can put it throught the ST-GCN model?

Please refer to the deepstream_tao_apps/apps/tao_others/deepstream-pose-classification at master · NVIDIA-AI-IOT/deepstream_tao_apps (github.com) sample. The TAO pose classifier model( Pose Classification | NVIDIA NGC) needs succeeded 300 frames 34 key points coordinates, please refer to our sample.

Thank you, I will check it.

Hi @Fiona.Chen, I ran the deepstream-pose-classification. In this sample, the pipeline is pgie(peoplenet detect person) → tracker → sgie0(extract skeletons) → nvpreprocess1(preprocess skeletons) → sgie1(predict action) right?

Now, I want to change the pgie into Yolov8-pose so the pipeline will be like pgie(Yolov8-pose) → tracker → sgie(ST-GCN predict action). But your pretrained ST-GCN has “nvidia” graph_layout and it requires 34 joints meanwhile the Yolov8-pose only provide 17 joints. So, seem like your pretrained ST-GCN can’t be used along with Yolov8-pose right? Please correct if I’m wrong. Therefore, I want to use my own ST-GCN so the pipeline will be pgie(Yolov8-pose) → tracker → sgie(custom ST-GCN predict action)

I have some questions:

  1. My ST-GCN model have two inputs as I mentioned before, how do I change the pipeline as I described? Can you guide me the steps and the things that I need to do to modify the pipeline base on the deepstream-pose-classification sample?
  2. I saw the labels of the NVIDIA dataset here. I’m not access it yet but can it be transformed into COCO format?
  3. Your pretrained ST-GCN can only infer a single person. But I want to infer multiple people, so do I have to re-train it?

Thank you very much.

No. The two models can’t be used together if no change.

The gst-nvinfer and gst-nvdspreprocess are all open source. You can modify and customize them to make them adapt to your model

Please raise topic in TAO forum for the dataset and model related questions. Latest Intelligent Video Analytics/TAO Toolkit topics - NVIDIA Developer Forums

Please raise topic in TAO forum for the dataset and model related questions. Latest Intelligent Video Analytics/TAO Toolkit topics - NVIDIA Developer Forums

Thank you very much. I will ask on the forum if I have any issue.

Hi @Fiona.Chen,

I tried to modify nvinfer for my custom SGIE but I face this error:

0:00:08.748820341 783977 0xaaab07815b00 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2095> [UID = 2]: deserialized trt engine from :/data/st_gcn_model/st_gcn.onnx_b32_gpu0_fp32.engine
INFO: [FullDims Engine Info]: layers num: 3
0   INPUT  kFLOAT batch_vid       2x15x13         min: 1x2x15x13       opt: 32x2x15x13      Max: 32x2x15x13      
1   INPUT  kFLOAT mot             2x14x13         min: 1x2x14x13       opt: 32x2x14x13      Max: 32x2x14x13      
2   OUTPUT kFLOAT output_action   8               min: 0               opt: 0               Max: 0               

0:00:09.108132840 783977 0xaaab07815b00 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2198> [UID = 2]: Use deserialized engine model: /data/st_gcn_model/st_gcn.onnx_b32_gpu0_fp32.engine
0:00:09.108391563 783977 0xaaab07815b00 ERROR                nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Error in NvDsInferContextImpl::preparePreprocess() <nvdsinfer_context_impl.cpp:1035> [UID = 2]: RGB/BGR input format specified but network input channels is not 3
ERROR: Infer Context prepare preprocessing resource failed., nvinfer error:NVDSINFER_CONFIG_FAILED
0:00:09.136593031 783977 0xaaab07815b00 WARN                 nvinfer gstnvinfer.cpp:912:gst_nvinfer_start:<sgie> error: Failed to create NvDsInferContext instance
0:00:09.137384722 783977 0xaaab07815b00 WARN                 nvinfer gstnvinfer.cpp:912:gst_nvinfer_start:<sgie> error: Config file path: config_infer_second_st_gcn.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED

ERROR: gst-resource-error-quark: Failed to create NvDsInferContext instance (1): /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(912): gst_nvinfer_start (): /GstPipeline:pipeline0/GstNvInfer:sgie:
Config file path: config_infer_second_st_gcn.txt, NvDsInfer Error: NVDSINFER_CONFIG_FAILED

Here is the config_infer_second_st_gcn.txt file:

[property]
gpu-id=0
net-scale-factor=1
onnx-file=/data/st_gcn_model/st_gcn.onnx
model-engine-file=/data/st_gcn_model/st_gcn.onnx_b32_gpu0_fp32.engine
#custom-lib-path=/data/nvdsinfer_custom_st_gcn/nvdspreprocess_lib/libcustom2d_preprocess.so
network-type=1
network-mode=0
batch-size=32
process-mode=2
gie-unique-id=2
operate-on-class-ids=0
#input-tensor-meta=1
output-blob-names=output_action
# Adjust network-input-dims to match your model's input dimensions
#network-input-dims=2;15;13;2;14;13
parse-classifier-func-name=NvDsParseCustomPoseClassification
custom-lib-path=/data/nvdsinfer_custom_st_gcn/infer_pose_classification_parser/libnvdsinfer_pose_classfication_parser.so
classifier-threshold=0.51

[user-configs]
#actual sequence length of frames
frames-sequence-length=15

Can you help me with this problem? Thank you very much.

Your model is not a standard classifier. Please refer to deepstream_tao_apps/apps/tao_others/deepstream-pose-classification at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub for proper settings.

Thank you very much. I just modify it and now I face this error:

0:00:09.996076057 814161 0xaaab281b4500 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:2095> [UID = 2]: deserialized trt engine from :/data/st_gcn_model/st_gcn.onnx_b32_gpu0_fp32.engine
INFO: [FullDims Engine Info]: layers num: 3
0   INPUT  kFLOAT batch_vid       2x15x13         min: 1x2x15x13       opt: 32x2x15x13      Max: 32x2x15x13      
1   INPUT  kFLOAT mot             2x14x13         min: 1x2x14x13       opt: 32x2x14x13      Max: 32x2x14x13      
2   OUTPUT kFLOAT output_action   8               min: 0               opt: 0               Max: 0               

0:00:10.405892658 814161 0xaaab281b4500 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2198> [UID = 2]: Use deserialized engine model: /data/st_gcn_model/st_gcn.onnx_b32_gpu0_fp32.engine
0:00:10.413936882 814161 0xaaab281b4500 WARN                 nvinfer gstnvinfer.cpp:679:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Warning from NvDsInferContextImpl::initNonImageInputLayers() <nvdsinfer_context_impl.cpp:1622> [UID = 2]: More than one input layers but custom initialization function not implemented
0:00:10.413996437 814161 0xaaab281b4500 ERROR                nvinfer gstnvinfer.cpp:676:gst_nvinfer_logger:<sgie> NvDsInferContext[UID 2]: Error in NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1386> [UID = 2]: Failed to initialize non-image input layers
0:00:10.442405376 814161 0xaaab281b4500 WARN                 nvinfer gstnvinfer.cpp:912:gst_nvinfer_start:<sgie> error: Failed to create NvDsInferContext instance
0:00:10.444120849 814161 0xaaab281b4500 WARN                 nvinfer gstnvinfer.cpp:912:gst_nvinfer_start:<sgie> error: Config file path: config_infer_second_st_gcn.txt, NvDsInfer Error: NVDSINFER_CUSTOM_LIB_FAILED

ERROR: gst-resource-error-quark: Failed to create NvDsInferContext instance (1): /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(912): gst_nvinfer_start (): /GstPipeline:pipeline0/GstNvInfer:sgie:
Config file path: config_infer_second_st_gcn.txt, NvDsInfer Error: NVDSINFER_CUSTOM_LIB_FAILED

How do I create custom function to handle 2 inputs? Can you give me some suggestions? Do you have any example or sample similar/related to this?

gst-nvinfer is open source. You need to modify the plugin to accept two input layers.

Do you have any example or document similar to this?

There is no such example and there is graph for the gst-nvinfer source code. DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums

Hi, @Fiona.Chen. I saw this on website documentation.


Where can I find the objectDetector_FasterRCNN sample because I don’t see it in deepstream-7.0? Thank you very much.

The fastRCNN sample is removed since TensorRT 8.x does not support the caffe model.

You can refer to the sample
deepstream_tao_apps/apps/tao_others/deepstream-pose-classification at master · NVIDIA-AI-IOT/deepstream_tao_apps (github.com) for how to implement NvDsInferInitializeInputLayers() interface in /opt/nvidia/deepstream/deepstream/sources/includes/nvdsinfer_custom_impl.h

Thank you, I will check it.

Hi @Fiona.Chen, I follow your lastest reply and I can initialize 2 inputs model as SGIE.

According to my knowledge, now the SIGE requires input tensor meta and I have to create a nvdspreprocess before SGIE to form tensors which are fitted to my SGIE. So, my question is how to copy tensors to the buffer in nvdspreprocess because my model has 2 inputs? I saw a sample about modifying nvdspreprocess of deepstream-pose-classification here but I’m still confuse about it.

Can you walk me through this? Thank you very much.