Camera projection matrix in Gst-nvtracker Single-View 3D Tracking

I am using single view 3D tracking with gst-nvtracker and following the instructions provided in the documentation:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvtracker.html

According to the documentation, I need to provide a 4x3 projection matrix that maps 3D world points to the image plane.

I have the camera rotation matrix (R), the camera position (t) in world coordinates, and the intrinsic parameters of the camera, so I can construct the required projection matrix.

However, one thing is unclear:
It seems that the world coordinate system and its origin can be chosen arbitrarily, which would affect the values of R, t, and consequently the projection matrix.

Are there any assumptions about the world coordinate system that gst-nvtracker expects?
For example, does it assume that the ground plane is at Z = 0 and that the Z-axis points upward?
Are there any other constraints or conventions that I should follow regarding the world coordinate system and origin?
Additionally, when constructing the projection matrix from R and t, what units should t be in? Should it be in meters, pixels, or another unit?

I’d appreciate any clarification on these assumptions.

Thanks in advance!

The units is cm. You can refer this topic to generate the projection matrix: SV3DT: projection matrix - #9 by kesong

Thanks so much for your response to my question and the reference you provided.
After reviewing the linked discussion, I still have some unresolved issues, and I would be grateful for further clarification.

I understand that the projection matrix should be defined in centimetres and that it is projecting 3D world coordinates to pixel coordinates. Below is the process I followed to construct the P matrix and verify its accuracy

Given K, R, and T matrices I create the camera projection matrix using the following:

def create_projection_matrix(K, R, t):     
    #Parameters: 
    #K (numpy.ndarray): Camera intrinsic matrix 
    #R (numpy.ndarray): Rotation matrix 
    #t (numpy.ndarray): Translation vector 
    #Returns: 
    #numpy.ndarray: Projection matrix 
    K_copy = K.copy()   
    K_copy[0, 2] = 0 
    K_copy[1, 2] = 0 
    t = t.reshape(3, 1) 
    # Create [R|t] (3x4 matrix) 
    Rt = np.hstack((R, t)) 
    # Calculate projection matrix 
    P = np.matmul(K_copy, Rt) 
    return P 

I verify the projection matrix is correct by using it to project known 3d points to image pixel.
I use the following code to apply the projection:

a = P @ ph 
a = a / a[2] 

#Add center offsets 
x = xcen + a[0] 
y = ycen + a[1] 

Where xcen,ycen are taken from the K matrix.

the following figure shows 3d location of 2 points on the floor, and their projection on the image.
As you can see the projection and therefore the P matrix are correct.
I am using cm as units.

I also set the coordinate system such that the floor is in z=0, and z axis is pointing up:

The calculated projection matrix P is:
[[ 3.23724976e+02 1.15928696e+02 7.09877348e+00 -5.58362305e+04]

[ 9.24083252e+01 -2.44074509e+02 -2.28158279e+02 1.39711982e+04]

[-2.07319349e-01 6.25012279e-01 -7.52581120e-01 3.32352905e+02]]

so my yml file of the camera parameters looks like this:
projectionMatrix_3x4:

  • 323.72498

  • 115.928696

  • 7.0987735

  • -55836.23

  • 92.408325

  • -244.07451

  • -228.15828

  • 13971.198

  • -0.20731935

  • 0.6250123

  • -0.7525811

  • 332.3529

the height and radius of the cylinder model

modelInfo:

height: 205

radius: 33.

my video has a resolution of 640x360 and runs at 30 fps, so in the app txt file i set:

tracker-width=640

tracker-height=360

and also under [streammux]
width=640

height=360.

when running sv3dt with this configuration i get no detections at all.

What’s very strange is that when using the camera parameters from the supermarket example in the DeepStream reference applications from:

<deepstream_reference_apps/deepstream-tracker-3d/README.md at master · NVIDIA-AI-IOT/deepstream_reference_apps · GitHub >

I do get detections although not very accurate.

How can it be that with camera parameters from different camera I get better results than the actual camera parameters in the scene?

This makes me suspect that something in my setup is incorrect, but I’m not sure what. Given that my projection matrix correctly maps 3D points to pixel coordinates, what could be causing the tracker to fail completely with my parameters? Are there additional constraints or settings I should check?

I’d really appreciate any help in diagnosing what might be wrong with my configuration. Thank you for your time!

Can you share your test video and the script to generate projection matrix(include verify the project matrix is right)? So I can have a check in my side. Can you get the detection if disable SV3DT?

Dear Sir,

Thank you for your response and for taking the time to assist me. As requested, I am attaching the following:

  1. The test video I am currently testing.
    j-transparent.avi

  2. The script I used to generate the projection matrix, including the verification step to ensure it correctly projects 3D points to the image plane.

import numpy as np
import cv2


def create_projection_matrix(K, R, t):

    K_copy = K.copy()  # Create a copy to avoid modifying the original K
    K_copy[0, 2] = 0
    K_copy[1, 2] = 0
    t = t.reshape(3, 1)
    Rt = np.hstack((R, t))
    P = np.matmul(K_copy, Rt)
    return P


def project_world_points_on_image(image_path, world_points, P, xcen, ycen, output_path=None, draw_labels=False):

    # Print projection matrix
    print("Projection Matrix:")
    print(P)

    # Load the image
    img = cv2.imread(image_path)
    if img is None:
        raise ValueError(f"Could not load image from {image_path}")

    # Print image dimensions
    height, width, channels = img.shape
    print(f"Image dimensions: {width}x{height}, {channels} channels")

    # Initialize array for projected points
    image_points = []

    # Project each world point
    for point in world_points:
        # Scale and homogenize the point
        ph = np.append(point, 1).reshape(4, 1)

        # Project using the projection matrix
        a = P @ ph
        a = a / a[2]

        # Add center offsets and ensure we extract scalar values
        x = xcen + float(a[0][0])
        y = ycen + float(a[1][0])

        image_points.append([x, y])

    # Use red color for all points
    point_color = (0, 0, 255)  # Red in BGR
    point_size = 5
    thickness = -1  # Filled circle

    # Draw each projected point
    for i, point_2d in enumerate(image_points):
        x_int, y_int = int(round(point_2d[0])), int(round(point_2d[1]))

        # Draw circle for the point
        cv2.circle(img, (x_int, y_int), point_size, point_color, thickness)



    # Display the image
    cv2.imshow('Image with Projected Points', img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

    # Save the image if output path is provided
    if output_path:
        cv2.imwrite(output_path, img)
        print(f"Image with projected points saved to {output_path}")

    return image_points


def main():

    K = np.array([
        [330.76614, 0.0, 321.2513],
        [0.0, 330.76614, 163.69617],
        [0.0, 0.0, 1.0]
    ], dtype=np.float32)

    R = np.array([
        [0.8371958, -0.5355462, -0.11087569],
        [-0.38578108, -0.4345865, -0.8138228],
        [0.38765463, 0.7241028, -0.5704376]
    ], dtype=np.float32)

    t = np.array([-84.436005, 144.8109, 300.66653], dtype=np.float32)

    # create the projection matrix
    P = create_projection_matrix(K, R, t)
    print("Projection matrix")
    print(P)



    Project3DPoints = False

    # Get center points from K matrix

    if Project3DPoints:
        xcen = K[0, 2]
        ycen = K[1, 2]

        # Define world points
        world_points = 100*np.array([
            [2.49, 2.0, 0],
            [2.82, 0.54, 0],
            [0, 1.242, 0.75],
            [0.934, 3.0296, 0.95]
        ], dtype=np.float32)

        # Path to your image
        image_path = r"C:\Users\AsafShim\Downloads\j-transparent.jpg"
        output_path = "projected_points.jpg"

        # Project points on the image using precomputed projection matrix
        project_world_points_on_image(image_path, world_points, P, xcen, ycen, output_path, draw_labels=False)


if __name__ == "__main__":
    main()
  1. Two images showing the 3D locations of key points and their corresponding projections on the image.

  1. A video showing the current detections, where you can see that the bounding boxes are too large and incorrect, and that the IDs of the detected people are changing frequently, making tracking unstable.

Regarding your suggestion, I did try running without SV3DT by setting the following in deepstream_app.txt:

[tracker]
enable=0

This indeed fixed the bounding box issue, making them correctly aligned with the detected people. However, this is obviously not a solution for me, since I need 3D tracking and the tracker to work properly.

Thank you again for your help!

We recommand to use OpenCV’s camera calibration method in: SV3DT: projection matrix - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums. There is script to generate the projection matrix. You need 4 points in both 3D and 2D to get the camera extrinsic matrix. You can verify the generated projection matrix with the video in that topic with: deepstream_reference_apps/deepstream-tracker-3d/README.md at master · NVIDIA-AI-IOT/deepstream_reference_apps · GitHub

Thanks for your suggestion regarding using OpenCV’s camera calibration method to generate the projection matrix. I have implemented it and tested the results. Unfortunately, I am still facing the same issue.

To ensure correctness, I am attaching the following:

The Python script I used to generate the projection matrix. The script includes both:

A function to compute the projection matrix from four 3D-2D correspondences.
A verification function that projects the 3D points onto the image to check the accuracy of the matrix.

import numpy as np
import cv2


def project_world_points_on_image(image_path, world_points, P, output_path):


    img = cv2.imread(image_path)
    image_points = []

    # Project each world point
    for point in world_points:
        ph = np.append(point, 1).reshape(4, 1)
        a = P @ ph
        a = a / a[2]
        x = float(a[0][0])
        y = float(a[1][0])
        image_points.append([x, y])


    point_color = (0, 0, 255)
    point_size = 5
    thickness = -1


    for i, point_2d in enumerate(image_points):
        x_int, y_int = int(round(point_2d[0])), int(round(point_2d[1]))
        cv2.circle(img, (x_int, y_int), point_size, point_color, thickness)
    cv2.imshow('Image with Projected Points', img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

    if output_path:
        cv2.imwrite(output_path, img)

def compute_projection_matrix(K, points_3D, points_2D):

    dist_coeffs = np.zeros((4, 1))
    success, rvec, tvec = cv2.solvePnP(points_3D, points_2D, K, dist_coeffs, flags=cv2.SOLVEPNP_P3P)

    R, _ = cv2.Rodrigues(rvec)

    # Concatenate [R | t] to form the 3x4 extrinsic matrix
    Rt = np.hstack((R, tvec))

    # Compute the projection matrix: P = K * [R | t]
    P = np.dot(K, Rt)

    return P


K = np.array([
        [330.76614, 0.0, 321.2513],
        [0.0, 330.76614, 163.69617],
        [0.0, 0.0, 1.0]
], dtype=np.float32)


points_3D = np.array([
    [2.48, 2, 0],
    [2.82, 0.542, 0],
    [0.9346, 3.029, 0.95],
    [2.0695, 1.9345, 0.95]
]) * 100  # Multiply by 100 to be in cm


points_2D = np.array([
    [328, 139],
    [405, 171],
    [203, 97],
    [300, 95]
], dtype=np.float64)


P = compute_projection_matrix(K, points_3D, points_2D)

print("Projection Matrix:\n", P)

print("projectionMatrix_3x4:")
for row in P:
    print("  -", list(row)[0])
    print("  -", list(row)[1])
    print("  -", list(row)[2])
    print("  -", list(row)[3])

image_path = r"C:\Users\AsafShim\Downloads\j-transparent.jpg"
output_path = "projected_points_cv.jpg"
project_world_points_on_image(image_path, points_3D, P, output_path)

This image shows
The 3D points in world coordinates.
and the corresponding manually marked 2D points on the image.

I verified that when using the projection matrix to project the 3D points, they align exactly with their corresponding pixel locations in the image.

This is the projection matrix I placed in the .yml file.

projectionMatrix_3x4_w2p:

  • 398.2512522753727
  • 62.09416837270102
  • -223.9395048330216
  • 69863.75993218269
  • -67.27429151301806
  • -26.7476309597462
  • -361.8859086935327
  • 98758.5434165381
  • 0.3695215531960892
  • 0.7299604267387418
  • -0.5749883452027038
  • 314.34493574459873

the height and radius of the cylinder model

modelInfo:
height: 205
radius: 33

Here is the result video, as you can see the bounding boxes remain too large and incorrect, and the ID assignment is still unstable (people frequently change IDs).

Since the verification process confirms that the projection matrix is correct, but the tracker still fails, I suspect something else might be affecting the tracking. Do you have any insights on additional parameters or settings that might influence SV3DT’s behavior?

I would appreciate any further guidance on what could be causing this issue. Thanks again for your help!

You can tune “modelInfo” as below:

modelInfo:
  height: 175
  radius: 28

Here is the output in my side:

Thank you very much for your help! I tested the new modelInfo parameters you suggested (height: 175, radius: 28), and I can confirm that the results look much better now. I really appreciate your guidance.

I do have a few additional questions to better understand how to fine-tune these parameters and optimize the tracking:

  1. How should these parameters be determined?

    Do height and radius represent the approximate dimensions of a person?
    If so, why was a height of 205 cm used in the supermarket example, which seems quite tall?

  2. What are the model’s assumptions regarding the coordinate system?
    Does the model assume that the Z-axis points upward and that the floor is at Z = 0?
    Or is it simply that as long as the projection matrix correctly maps 3D world points to image coordinates, the tracking should work regardless of the coordinate system?

  3. Are there any additional constraints I should be aware of?
    Does the model make any assumptions about the video’s FPS?
    Should I adjust any parameters when running on a 30 FPS video compared to 60 FPS?

  4. Are there any other parameters worth adjusting that could further improve performance?
    If so, which parameters would you recommend experimenting with?

Again, I really appreciate your support, and thank you for your time!

The “modelInfo” parameters is determined during our test. Customer can tune it based on their use case. I don’t think the value is related with FPS. But one limitation is we only can set one “modelInfo” parameter for all person althrough the person aren’t the same tall. Let us know if you meet any issue in your project when using SV3DT.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.