SV3DT: projection matrix

Dear @kesong ,

I would like to discuss the issue on SV3DT.

Continuing the discussion from Having issues with Single-View 3D Tracking (SV3DT):

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): dGPU
• DeepStream Version: 7.0
• JetPack Version (valid for Jetson only)
• TensorRT Version: TRT 8.6.1.6
• NVIDIA GPU Driver Version (valid for GPU only): 555.42.06
• Issue Type( questions, new requirements, bugs): question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

Hi,

I have obtained a projection matrix for a camera, and I would like to use SV3DT, but the displayed bounding boxes are very SMALL.
I have changed from projectionMatrix_3x4 to projectionMatrix_3x4_w2p but still the displayed bounding boxes are even smaller.

Could you take a look on this issue? Thanks so much for your time.

Here is how I obtain the projection matrix, and I have project a 2D on the image to z=0 plane, and it does work.

import numpy as np
# Given intrinsic
K = np.array([  [9.32545185e+02, 0.00000000e+00, 1.10520066e+03], 
                [0.00000000e+00, 9.32545185e+02, 4.84232068e+02], 
                [0.00000000e+00, 0.00000000e+00, 1.00000000e+00]], dtype=np.float32)
# Given extrinsic
R = np.array([[ 0.753345,  0.657275,  0.021486],
            [-0.074001,  0.052263,  0.995888],
            [ 0.653449, -0.751837,  0.088011]], dtype=np.float32)
t = np.array([[-9862.566594],
            [ 1645.370521],
            [13956.835683]], dtype=np.float32)
# Find a matrix that maps from World to Cam
world2cam = np.hstack((R, t))
world2cam = np.vstack([world2cam, np.array([0, 0, 0, 1], dtype=np.float32)])

# Compute the projection matrix
K_h = np.hstack([K, np.zeros((3, 1), dtype=np.float32)])
P = np.dot(K_h, world2cam)

Next, I would like to evaluate SV3DT on a video, so I use deepstream-app. I modify the config_tracker_NvDCF_max_perf.yml

StateEstimator:
  stateEstimatorType: 3  # the type of state estimator among { DUMMY=0, SIMPLE=1, REGULAR=2 }
...
ObjectModelProjection:
  cameraModelFilepath: # camera calibration file for each stream
  - camLeftInfo.yml
  outputVisibility: 1 # output visibility by occlusion
  outputFootLocation: 1 # output foot location estimated from 3D model
  outputConvexHull: 1 # output convex hull for each object estimated from 3D cylinder model
projectionMatrix_3x4:
  - 1424.720302 
  - -217.99179 
  - 117.305913
  - 6227815.536202
  - 247.411798
  - -315.325942
  - 971.32802
  - 8292729.559499
  -  0.653449
  - -0.751837
  - 0.088011
  - 13956.835683

# the height and radius of the cylinder model
modelInfo:
  height: 205
  radius: 33

The blue dot on the frame_1000_viz is projected from a 3D point, and it is very close to the intersection of 2 green lines that shows the projection matrix is correct.

I define origin of the World coordinates: Oxyz, x and y are correct, but I am not sure about z.

We have SV3DT sample here: deepstream_reference_apps/deepstream-tracker-3d/README.md at master · NVIDIA-AI-IOT/deepstream_reference_apps · GitHub. You can refer this sample.
Does the bboxes size right if you disable SV3DT?

Dear @kesong ,

I can reproduce that repo.

Now, I would like to apply SV3DT to solve our occlusion issue in multi-object tracking.

If I disable SV3DT, that the bboxes size is correct.

When do calibration (obtaining the projection matrix) the unit I use mm. I do not know unit when affect the result.

If I increase the human body model size to

# the height and radius of the cylinder model
modelInfo:
  height: 1250
  radius: 132

I obtain the following output.

Dear @kesong ,

In addition, I have checked this doc: The 3x4 Camera Projection Matrix, it reads

"
For projectionMatrix_3x4 in a camera model file (e.g., camInfo-01.yml), the principal point (i.e., (Cx, Cy)) in the camera matrix is assumed to be at (0, 0) as image coordinates. But, the optical center (i.e., (Cx, Cy)) is located at the image center (i.e., (img_width/2, img_height/2)). Thus, to move the origin to the left-top of the camera image (i.e., the pixel coordinates), SV3DT internally adds (img_width/2, img_height/2) after the transformation using the camera matrix provided in projectionMatrix_3x4.
"

However, the (Cx, Cy) obtained in the Intrinsic cam calib is not always (img_width/2, img_height/2). In my case, (img_width/2, img_height/2) = (1920/2, 1080/2), while (Cx, Cy) = (1105, 484).

Can you share the original video to us to have a check in my side? How you ensure the intrinsic and extrinsic matrix of your camera is right?

Dear @kesong ,

Please find it here: GX3L_1080pFHR.mp4 - Google Drive

Dear @kesong ,

Here is how I find and validate the projection matrix.

Briefly, I use a built-in function of the OpenCV lib, i.e., cv2.solvePnP.

Then I take a 3D point, then use two methods to project it to the image plane

  • uv, _ = cv2.projectPoints(center_3D, rvec, tvec, cmx, dist)
  • _, center_2D_h, _ = cam.forwardprojectP(center_3D.reshape(3,1), P, (6000, 6000))

I check that the results are very close.

import numpy as np
from python_camera_library import rectilinear_camera as cam # https://github.com/JarnoRalli/python_camera_library
import cv2

def draw_corners(corners, img, lt=2):
    for i in range(corners.shape[0]-1):
        p1 = np.squeeze(corners[i])
        p1 = int(p1[0]), int(p1[1])
        
        p2 = np.squeeze(corners[i+1])
        p2 = int(p2[0]), int(p2[1])
        
        cv2.line(img, p1, p2, (0, 255, 0), thickness=lt)
    p1 = np.squeeze(corners[0])
    p1 = int(p1[0]), int(p1[1])
    cv2.line(img, p2, p1, (0, 255, 0), thickness=lt)
    
def draw(img, corners, imgpts):
    imgpts = imgpts.astype(np.int32)
    corner = tuple(corners[0].ravel())
    corner = (int(corner[0]), int(corner[1]))
    cv2.line(img, corner, tuple(imgpts[0].ravel()), (255,0,0), 10)
    cv2.line(img, corner, tuple(imgpts[1].ravel()), (0,255,0), 10)
    cv2.line(img, corner, tuple(imgpts[2].ravel()), (0,0,255), 10)    

def draw_point(image, point, radius=10):
    point = np.squeeze(point)
    point = (int(point[0]), int(point[1]))
    cv2.circle(image, point, radius=radius, color=(255, 255, 255), thickness=-1)

def rtvec_to_matrix(rvec, tvec):
    "Convert rotation vector and translation vector to 4x4 matrix"
    rvec = np.asarray(rvec)
    tvec = np.asarray(tvec)

    T = np.eye(4)
    (R, jac) = cv2.Rodrigues(rvec)
    T[:3, :3] = R
    T[:3, 3] = tvec.squeeze()
    return T

# 15240 mm
axis = np.float32([[15240/4.0,0,0], [0,8530/3.0,0], [0,0,-8530/4.0]]).reshape(-1,3)
center_3D = np.array([15240/2.0, 8530/2.0, 0]).reshape(-1,3)

# instrinsic matrix
cmx = np.array([[9.32545185e+02, 0.00000000e+00, 1.10520066e+03], 
                [0.00000000e+00, 9.32545185e+02, 4.84232068e+02], 
                [0.00000000e+00, 0.00000000e+00, 1.00000000e+00]], dtype=np.float32)
# distortion coefficents
dist = np.zeros((1, 5), dtype=np.float32)

# 3d points
objp = np.array([[0, 0, 0], [15240, 0, 0], [0, 8530, 0], [15240, 8530, 0]], dtype=np.float32)
# 2d points
corners = np.array([[[432, 593]], [[1167, 495]], [[550, 873]], [[1524, 571]]], dtype=np.float32)

# obtain extrinsic
ret, rvec, tvec = cv2.solvePnP(objp, corners, cmx, dist)
axis_img,_ = cv2.projectPoints(axis, rvec, tvec, cmx, dist)

# project a 3d point to image plane using a built-in opencv
uv, _ = cv2.projectPoints(center_3D, rvec, tvec, cmx, dist)

img = cv2.imread("assets/image/camera00_frame_1000.jpg")
draw(img, corners, axis_img)
draw_corners(corners, img, lt=2)
draw_point(img, uv[0], radius=20)
cv2.imwrite("assets/image/camera00_frame_1000_viz.jpg", img)

# Form projection matrix
world2cam = rtvec_to_matrix(rvec, tvec)
cmx_h = np.hstack([cmx, np.zeros((3, 1), dtype=np.float32)])
P = np.dot(cmx_h, world2cam)
print(P)
#  project the same 3d point to image plane using projection matrix
_, center_2D_h, _ = cam.forwardprojectP(center_3D.reshape(3,1), P, (6000, 6000))

# compare the projected points obtained by two methods: cv2.projectPoints vs cam.forwardprojectP
print(np.squeeze(uv))
print(np.squeeze(center_2D_h.reshape(1,3)))


I adjust the width and height in 3D from 15240 mm and 8530 mm to below. I can get below projection matrix. How you get the physical size of the playground? Is this for MTMC: NVIDIA Multi-Camera Tracking AI Workflow?

ww = 1524*5/8
wh = 853*5/8
projectionMatrix_3x4_w2p:
  - 1420.7594071572703
  - -216.20613986389156
  - 160.64448242774898
  - 355830.7074724467
  - 230.65644450891378
  - -214.78494330734117
  - 1002.3902449279533
  - 495207.1844719355
  - 0.6448672304670728
  - -0.7439225402699683
  - 0.17528693376291565
  - 833.0137074606193

# the height and radius of the cylinder model
modelInfo:
  height: 205
  radius: 33

I can get below result:

1 Like

Dear @kesong ,

Thanks you so much for your time.

Could you let me know the way think that adjusting a scale could solve the issue?

In the next experiment, I will try many different cameras, is there a systematic method to adapt Project Matrix obtained by OpenCV to the camInfo.yaml?

I got it here: https://www.networldsports.com/buyers-guides/basketball-court-dimensions

Yes, I am going to try that. I also would like to use SV3DT.

In addition, I am going to try these two repos (3D pose tracking from multiple views).

Dear @kesong ,

From the doc, I know this is the dim of the cylinder model for person. Would it be possible that you could provide info how your team comes up with those values?

Hello user,

The height and radius in modelInfo are set empirically that works best for typical pedestrians in the scene, and their units are [cm]. You can try adjusting these numbers based on the pedestrians in your camera view.

For SV3DT to work, please make sure you are using the same video frame height/width as the original video, which you used to get 3x4 camera matrix. Especially, frame resolution in streammux and tracker plugins should be the same as the original video, without any scaling.

1 Like

Dear @pshin ,

Thanks so much for your info.

Regarding SV3DT, as you can see the modelInfo (cylinder model) cannot fit all to all persons in the video, if I use SV3DT in config_tracker_NvDeepSORT.yml and set outputReidTensor: 1 and

application:
  enable-perf-measurement: 1
  perf-measurement-interval-sec: 5
  kitti-track-output-dir: /home/materials/dumpData/mct_bb/kitti_track_dir_path/camL
  reid-track-output-dir: /home/materials/dumpData/mct_bb/reid_track_dir_path/camL

So the saved bboxes coordinates and embeddings are of the original detected bboxes returned from the detector or the ones map from World to Image (showing in the visualization video)?

ReID embeddings are extracted from the original detection bbox, not from the modified one by SV3DT.

Yes, one limitation of SV3DT is that we are using a single cylindrical model for all objects. But, little bit of differences are handled by the internal algo to give best foot location.

1 Like

Dear @pshin ,

Thanks for the info.

Dear @pshin ,

I have integrated SV3DT into config_tracker_NvDeepSORT.yml, but the out.mp4 (sink2 in deepstream-app) does not show any bboxes, but if I integrate SV3DT into config_tracker_NvDCF_max_perf.yml it works fine.

Have you evaluate SV3DT with config_tracker_NvDeepSORT.yml?

StateEstimator:
  stateEstimatorType: 3    # the type of state estimator among { DUMMY=0, SIMPLE=1, REGULAR=2 }

  # [Dynamics Modeling]
  noiseWeightVar4Loc: 0.0503   # weight of process and measurement noise for bbox center; if set, location noise will be proportional to box height
  noiseWeightVar4Vel: 0.0037    # weight of process and measurement noise for velocity; if set, velocity noise will be proportional to box height
  useAspectRatio: 1    # use aspect ratio in Kalman filter's observation

ObjectModelProjection:
  cameraModelFilepath: # camera calibration file for each stream
  - camLeftInfo.yml
  outputVisibility: 1 # output visibility by occlusion
  outputFootLocation: 1 # output foot location estimated from 3D model
  outputConvexHull: 1 # output convex hull for each object estimated from 3D cylinder model

SV3DT has been tested with NvDCF tracker only. So, it may not work with NvDeepSORT. I would recommend to go with NvDCF_accuracy.yml, because it has better accuracy than NvDeepSORT, and you can get ReID embedding from it if you want.

Dear @pshin ,

I think I config the dynamic model wrong, I will copy the StateEstimator grp from NvDCF.

As you can see, occlusion happens very often in my case: GX3L_1080pFHR.mp4 - Google Drive

So, track swap also happens if NvDCF is used, so I would like to use NvDeepSORT to reduce swapped tracks. Do you think so, or if you have any suggestion please give me a hint.

Thank you!

Dear @pshin ,

Even I copy StateEstimator from NvDCF to NvDeepSORT, but it does not work. So I switch to NvDCF.

Regarding StateEstimator in NvDCF, the default values provided by your team: larger for tracker and smaller for detector. In this case you assume the tracker will keep track even in the case of unreliable detection model? If I have an accurate detection model, I should reduce the values for the tracker?

StateEstimator:
  stateEstimatorType: 3    # the type of state estimator among { DUMMY=0, SIMPLE=1, REGULAR=2 }

  # [Dynamics Modeling]
  processNoiseVar4Loc: 6810.8668    # Process noise variance for bbox center
  processNoiseVar4Size: 1541.8647   # Process noise variance for bbox size
  processNoiseVar4Vel: 1348.4874    # Process noise variance for velocity
  measurementNoiseVar4Detector: 100.0000   # Measurement noise variance for detector's detection
  measurementNoiseVar4Tracker: 293.3238    # Measurement noise variance for tracker's localization

You get less track swap (i.e., ID switch) when you use NvDeepSORT?

Generally, the tracker params are to be tuned based on the specific usecases. You can try tuning them manually using some guideline in Accuracy Tuning Tools — DeepStream documentation

Or, you could try using PipeTuner if you have some relevant dataset: Pipetuner Guide — DeepStream documentation

1 Like

Dear @pshin ,

Thanks for the info. I understand, I should tune those hyperparams for my use cases. Aso I am going to try Pipetuner.

1 Like