I set up YOLOv7-pose inference on multiple streams using this app. To perform inference on different ROIs, I also employ the nvpreprocess plugin. However, I need to transform the extracted keypoints back to the frame coordinate system, as the output of the inference is in ROI coordinates. How can I access the top-left coordinate of the corresponding ROI for each detected object/keypoint from the metadata to scale it back to the frame coordinates?
Yes, the information about the top left corner is written into the preprocess config file. However, for more than one ROI per frame I do not find any information in the object meta in which ROI the object was detected, therefore I do not know with which top-left ROI corner I have to translate.
In theory, you don’t need to pay attention to the coordinate transformation processing. We will convert that internally. You can refer to our source code attach_metadata_detector in the open source sources\gst-plugins\gst-nvinfer\gstnvinfer_meta_utils.cpp.
Thanks for the answer. In the linked repository the keypoints are attached to the mask_params data, so I think they are not transformed by default. So I would have to transform them the same way as in the source code you mentioned. As it is just a prototype, I just attach the bounding box coordinates to the mask_params data and calculate the shift from the difference of these coordinates to the box coordinates transformed internally.
OK. So when you run this demo with nvdspreprocess, is the bbox in the ROI correctly drawn? You can also refer to our demo to implement similar feature. deepstream-pose-classification
Thanks for the reply and the reference. The bbox is transformed and displayed correctly by default. Here is a snapshot of the default output without transformed keypoints:
To transform the keypoints I used the difference between the untransformed bbox predictions (which is also given in the output of the yolov7 pose model) and the internally transformed bbox predictions. It is not very clean but it works well: