Get object center pose in camera frame / viewport

Hello,

Is there some sort of function or code to retrieve the center position of an object not in world coordinates, but in my 2D image? I know how to retrieve the world pose, but am yet to find a solution to get the center position of the object for my neural network training. Is there anything existing that either just gives the object center pose of my asset directly or some kind of factor I can use to calculate from world position into camera pixels?

Thank you!

Hi there,

using the cameraViewTransform and the cameraProjection from the camera parameters annotator you can transform the world coordinates of the object to camera coordinates by multiplying the objects pose with the inverse of the view transform matrix.

You can then apply the camera projection by multiplying the resulting camera coordinates with the projection matrix. Projecting the 3D coordinate onto the 2D image plane. Resulting in a 4-d coordinate vector [x, y, z, w] (where w is the scaling factor).

To get the position of the object in the camera’s view frustum, with respect to the camera’s coord. system you should divide [x, y, z, w] with w resulting in [x’, y’, z’] (normalised coordinates).

For the pixel coordinate conversion [u, v], normalize [x’, y’, z’] by dividing by z’:

u = x’ / z’
v = y’ / z’

and scaling the normalized coord. with the img. resolution:

u *= image_width
v *= image_height

hi, adaidu, I run the code

    writer = rep.WriterRegistry.get("BasicWriter")
    writer.initialize(
        output_dir="F:\\ProjectCodes\\omniverse\\testAnnotator2",
       # bbox_height_threshold=25,
        #fully_visible_threshold=0.95,
        # omit_semantic_type=True,
        bounding_box_3d=True,
        bounding_box_2d_tight=True,
        rgb=True,
        distance_to_image_plane=True,
        camera_params=True,
        occlusion=True


    )

and I get the rgb_0000.png, camera_params_0000.json, bounding_box_3d_0000.npy.
the “cameraViewTransform” 、 “cameraProjection” in camera_params_0000.json. 3D coordinate and “transform” in bounding_box_3d_0000.npy. I run the code in python used with numpy ,but I can’t get the Correct image annotation. please help, thank you

np.dot(cameraProjection,np.dot(cameraViewTransform,np.dot(trans,3DPoint)))

test.zip (1.4 MB)

test.zip (1.4 MB)

Hi everyone,
I have the same problem asked in this chat. Specifically, I am trying to reproject the corners of a 3D box annotation generated with replicator to a 2D image. I followed your suggestions, but the projected points are wrongly positioned. I attached a copy of my code. Please let me know what I am doing wrong.

def from_3d_to_2d(bounding_box_3D, cameraViewTransform, cameraProjection, image_width, image_height, image):

    # Initialize lists to store the u and v coordinates
    u_values = []
    v_values = []
        
    x_min = bounding_box_3D[1]
    y_min = bounding_box_3D[2]
    z_min = bounding_box_3D[3]
    x_max = bounding_box_3D[4]
    y_max = bounding_box_3D[5]
    z_max = bounding_box_3D[6]
    
    M = np.array([bounding_box_3D[7]]).reshape(4,4)
        
    corners = np.array([ 
               [x_min, y_min, z_min],
               [x_min, y_max, z_min],
               [x_max, y_min, z_min],
               [x_max, y_max, z_min],
               [x_min, y_min, z_max],
               [x_min, y_max, z_max],
               [x_max, y_min, z_max],
               [x_max, y_max, z_max],
    ])

    # Iterate over each corner of the bounding box
    for corner in corners:
                
        # Convert the corner to homogeneous coordinates
        corner_homogeneous = np.append(corner, 1).reshape(1,4) 
                
        # Express the corner in camera coordinates
        camera_coordinates = corner_homogeneous @ np.linalg.inv(M) @ np.linalg.inv(cameraViewTransform)
        
        # Apply the camera projection
        projected_coordinates = camera_coordinates @ cameraProjection 
        projected_coordinates = projected_coordinates[0]
        
        # The previous step is equivalent to the following:
        # projected_coordinates_new = np.dot(cameraProjection.T,np.dot(np.linalg.inv(cameraViewTransform.T),np.dot(np.linalg.inv(M.T), corner_homogeneous.T)))
                                        
        # Normalize the coordinates
        normalized_coordinates = projected_coordinates / (projected_coordinates[3])
        normalized_coordinates = normalized_coordinates[:3]

        # Convert to pixel coordinates
        u = normalized_coordinates[0] / normalized_coordinates[2]
        v = normalized_coordinates[1] / normalized_coordinates[2]

        # Scale the coordinates with the image resolution
        u *= image_width
        v *= image_height

        # Add the u and v coordinates to the lists
        u_values.append(u)
        v_values.append(v)
        
    for u_value,v_value in zip(u_values,v_values):
        cv2.circle(image, (int(u_value), int(v_value)), 5, (0, 255, 0), 2)

    img_path = "put/your/image/path/here"    
    
    cv2.imwrite(img_path, image)
    
    image = cv2.imread(img_path, cv2.IMREAD_COLOR)
    
    return image
1 Like

Hi all,

this code snippet is part of the DataVisualizationWriter that will come out with Isaac Sim 2023.1.1 and draws 3d bounding boxes on top of an image:

where:

  • bb_annot_name="bounding_box_3d_fast"
  • draw = ImageDraw.Draw(background_img) (e.g. the rgb image)
  • data is the dictionary from the write function def write(self, data: dict)
    def _draw_3d_bounding_boxes(
        self, draw: ImageDraw, data: dict, bb_annot_name: str, render_product_name: str, write_params: dict
    ):
        # Get the 3d bbox data from the annotator
        annot_data = data[bb_annot_name]["data"]

        # Access the camera parameters
        multiple_render_products = len(self._render_product_names) > 1
        camera_params_annot_name = (
            "camera_params" if not multiple_render_products else f"camera_params-{render_product_name}"
        )

        # Transpose is needed for the row-column-major conversion
        cam_view_transform = data[camera_params_annot_name]["cameraViewTransform"].reshape((4, 4))
        cam_view_transform = cam_view_transform.T
        cam_projection_transform = data[camera_params_annot_name]["cameraProjection"].reshape((4, 4))
        cam_projection_transform = cam_projection_transform.T

        # The resolution is used to map the Normalized Device Coordinates (NDC) to screen space
        screen_width, screen_height = data[camera_params_annot_name]["renderProductResolution"]

        # Get the line draw parameters
        line_color = "green" if "fill" not in write_params else write_params["fill"]
        line_width = 1 if "width" not in write_params else write_params["width"]

        # Iterate the bounding boxes and draw the edges
        for bbox_data in annot_data:
            # ('semanticId', '<u4'), ('x_min', '<f4'), ('y_min', '<f4'), ('z_min', '<f4'), ('x_max', '<f4'), ('y_max', '<f4'), ('z_max', '<f4'), ('transform', '<f4', (4, 4)), ('occlusionRatio', '<f4')
            # Bounding box points in local coordinate system
            x_min, y_min, z_min, x_max, y_max, z_max = (
                bbox_data[1],
                bbox_data[2],
                bbox_data[3],
                bbox_data[4],
                bbox_data[5],
                bbox_data[6],
            )

            # Transformation matrix from local to world coordinate system
            local_to_world_transform = bbox_data[7]
            local_to_world_transform = local_to_world_transform.T

            # Calculate all 8 vertices of the bounding box in local space
            vertices_local = [
                np.array([x_min, y_min, z_min, 1]),
                np.array([x_min, y_min, z_max, 1]),
                np.array([x_min, y_max, z_min, 1]),
                np.array([x_min, y_max, z_max, 1]),
                np.array([x_max, y_min, z_min, 1]),
                np.array([x_max, y_min, z_max, 1]),
                np.array([x_max, y_max, z_min, 1]),
                np.array([x_max, y_max, z_max, 1]),
            ]

            # Transform vertices to world, camera, and screen space
            vertices_screen = []
            for vertex in vertices_local:
                # Transform to world space
                world_homogeneous = np.dot(local_to_world_transform, vertex)
                # Transform to camera space
                camera_homogeneous = np.dot(cam_view_transform, world_homogeneous)
                # Projection transformation
                clip_space = np.dot(cam_projection_transform, camera_homogeneous)
                # Normalize Device Coordinates (NDC)
                ndc = clip_space[:3] / clip_space[3]
                # Map NDC to screen space
                screen_point = ((ndc[0] + 1) * screen_width / 2, (1 - ndc[1]) * screen_height / 2)
                vertices_screen.append(screen_point)

            # Draw the bounding box edges
            draw.line([vertices_screen[0], vertices_screen[1]], fill=line_color, width=line_width)
            draw.line([vertices_screen[0], vertices_screen[2]], fill=line_color, width=line_width)
            draw.line([vertices_screen[0], vertices_screen[4]], fill=line_color, width=line_width)
            draw.line([vertices_screen[1], vertices_screen[3]], fill=line_color, width=line_width)
            draw.line([vertices_screen[1], vertices_screen[5]], fill=line_color, width=line_width)
            draw.line([vertices_screen[2], vertices_screen[3]], fill=line_color, width=line_width)
            draw.line([vertices_screen[2], vertices_screen[6]], fill=line_color, width=line_width)
            draw.line([vertices_screen[3], vertices_screen[7]], fill=line_color, width=line_width)
            draw.line([vertices_screen[4], vertices_screen[5]], fill=line_color, width=line_width)
            draw.line([vertices_screen[4], vertices_screen[6]], fill=line_color, width=line_width)
            draw.line([vertices_screen[5], vertices_screen[7]], fill=line_color, width=line_width)
            draw.line([vertices_screen[6], vertices_screen[7]], fill=line_color, width=line_width)
1 Like

Although the center and the bounding boxes are correct. But it doesn’t give the rotation matrix for the objects in camera coordinate system to project on top of the 2D image. 3x3 Rotation matrix is required to get the ground truth pose of the objects. Below is an image to what i am trying to visualize.

f848c4488597d5566f9f7ed098cbbbbf61f57163

import numpy as np
import cv2
import json
import os
import matplotlib.pyplot as plt

def project2d(intrinsic, point3d):
        return (intrinsic @ (point3d / point3d[2]))[:2]


axis_len = 30

image_path = '/Test_Prop_test/000013.png' 

json_path = '/Test_Prop_test/camera_params_000013.json'
npy_path = '/Test_Prop_test/bounding_box_3d_000013.npy'

img = cv2.imread(image_path)

image= np.array(img)
height, width = image.shape[:2]

cam_intrinsic = np.array([
                            [641.37, 0.0, 320],
                            [0.0, 659.22, 240],
                            [0.0, 0.0, 1.0]])

with open(json_path, 'r') as file:

        data = json.load(file)
        camera_view_transform = data["cameraViewTransform"]
        camera_view_transform = np.array(camera_view_transform).reshape(4, 4)
        T_cam_world = camera_view_transform.T
        print("T_cam_world : ", T_cam_world )

        camera_projection = data["cameraProjection"]
        camera_projection = np.array(camera_projection).reshape(4, 4)
        camera_projection = camera_projection.T

BBtrans = np.load(npy_path, allow_pickle=True).tolist()

for entry in BBtrans:
        obj_id = entry[0]+1
        transformation_matrix = entry[7]
        T_obj_to_world = transformation_matrix.T

if obj_id==1:
                # Calculate T_world_obj by taking inverse of T_obj_world
                T_world_to_obj = np.linalg.inv(T_obj_to_world)
                # Calculate T_world_to_cam by taking the inverse of T_cam_to_world
                T_world_to_cam = np.linalg.inv(T_cam_world)
                T_cam_to_obj = T_world_to_obj @ T_world_to_cam
                RT =  np.linalg.inv(T_cam_to_obj)
 RT_centre = RT[:3, -1]
                RT_centre = cam_intrinsic @ (RT_centre / RT_centre[2])


                RT_centre = RT_centre[:2]
                print("obj_center: ", RT_centre)
                cv2.circle(img, tuple(RT_centre.astype(int)), radius=2, color=(0, 255, 0), thickness=5)

                rgb_colors = [(0, 0, 255), (0, 255, 0), (255, 0, 0)]
                for j in range(3):
                        obj_xyz_offset_2d = project2d(cam_intrinsic, RT[:3, -1] + RT[:3, j] * 0.001)
                        obj_axis_endpoint = RT_centre + (obj_xyz_offset_2d - RT_centre) / np.linalg.norm(obj_xyz_offset_2d - RT_centre) * axis_len
                        print("obj_xyz_offset_2d: ", obj_xyz_offset_2d)
                        print("obj_axis_endpoint", obj_axis_endpoint)
                        cv2.arrowedLine(img, (int(RT_centre[0]), int(RT_centre[1])), (int(obj_axis_endpoint[0]), int(obj_axis_endpoint[1])), rgb_colors[j], thickness=2, tipLength=0.15)


img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img_rgb)
plt.title('Image with Arrow')
plt.axis('off')
plt.show()

Desired pose/ correct pose from the viewport of isaac sim

Hi there,

the issue might be the transform multiplications due to the row-major representations of the matrices, you should basically either multiply with the transpose or use the translation vector on the left side:

point_camera = point_homogeneous @ view_matrix # row-major with the point on the left side of the multiplication
# OR
point_camera = view_matrix.T @ point_homogeneous  # column-major alternative approach with the transpose

Here are two helper functions that do this:

    # Project a 3D point in world coordinates into 2D screen coordinates
    def _project_world_point_to_screen(self, world_point, view_matrix, projection_matrix, screen_size):
        # Convert the 3D point to homogeneous coordinates (if not already in that form)
        if len(world_point) == 4:
            point_homogeneous = np.array(world_point)
        else:
            point_homogeneous = np.array([*world_point, 1.0])

        # Transform to camera frame (row-major representation where the translation vector is on the left side of the multiplication)
        point_camera = point_homogeneous @ view_matrix
        # point_camera = view_matrix.T @ point_homogeneous  # column-major alternative approach with the transpose

        # Apply the projection matrix to project into screen coordinates
        point_screen = point_camera @ projection_matrix
        # point_screen = np.dot(projection_matrix.T, point_camera)  # column-major alternative approach with the transpose

        # Normalize to NDC (Normalized Device Coordinates) by dividing x, y, z by w. Needed for 3D to 2D projection across various screen sizes/aspect ratios.
        point_screen_normalized = point_screen / point_screen[3]

        # Map NDC to screen coordinates. Adjust x and y for screen dimensions, flipping y to match screen's coordinate system.
        x = (point_screen_normalized[0] + 1) * screen_size[0] / 2
        y = (1 - point_screen_normalized[1]) * screen_size[1] / 2

        return int(x), int(y)


    # Draws the objects local frame axes at the objects world location
    def _draw_local_frame_axes(
        self, draw, local_to_world_transform, camera_view_matrix, camera_projection_matrix, screen_size, axes_length=0.2
    ):
        # Define the end points of the local coordinate system axes
        x_axis_end_point_local = np.array([axes_length, 0, 0, 1])
        y_axis_end_point_local = np.array([0, axes_length, 0, 1])
        z_axis_end_point_local = np.array([0, 0, axes_length, 1])

        # Transform local end points to world frame using row-major matrix multiplication (translation on the left side)
        x_axis_end_point_world = x_axis_end_point_local @ local_to_world_transform
        y_axis_end_point_world = y_axis_end_point_local @ local_to_world_transform
        z_axis_end_point_world = z_axis_end_point_local @ local_to_world_transform

        # Define a partial helper function to project 3D world points to 2D screen points
        project_to_screen = partial(
            self._project_world_point_to_screen,
            view_matrix=camera_view_matrix,
            projection_matrix=camera_projection_matrix,
            screen_size=screen_size,
        )

        # Extract world location from the row-major transform matrix (last row)
        origin_world = local_to_world_transform[3]
        # Project the origin and axes end points from 3D world coordinates to 2D screen coordinates
        origin_2d = project_to_screen(origin_world)
        x_axis_end_2d = project_to_screen(x_axis_end_point_world)
        y_axis_end_2d = project_to_screen(y_axis_end_point_world)
        z_axis_end_2d = project_to_screen(z_axis_end_point_world)

        # Draw the 3D axes on the 2D screen using lines with appropriate colors for each axis
        draw.line([origin_2d, x_axis_end_2d], fill="red", width=2)  # X-axis in red
        draw.line([origin_2d, y_axis_end_2d], fill="green", width=2)  # Y-axis in green
        draw.line([origin_2d, z_axis_end_2d], fill="blue", width=2)  # Z-axis in blue

0001_overlay
0002_overlay

1 Like

Thank you very much Mr. ahaidu. You are really genius.

One last query:

From the above script you just sent, How can i get the rotation matrix from objects to camera frame similar to translation vector i.e point_camera which you transform to camera frame? Is it possible?

        # Transform to camera frame (row-major representation where the translation vector is on the left side of the multiplication)
        point_camera = point_homogeneous @ view_matrix

For example: the desired rotation matrix should be homogeneous and the norm of each column is equal to one.

  "cam_R_m2c": [
    0.8180317284610569,  -0.5726349904075037, 0.05397461432945827,
    -0.32806818289014184, -0.5416101454156569, -0.773970101333373,
    0.47243556026846245, 0.6154247461256569, -0.6309176041687177
  ],
  "cam_t_m2c": [
    0.8948040944340092,
    -0.16644558774572973,
    0.536188038792591
  ]

The cameraViewTransform from the camera_params annotator has the world to camera transformation matrix.

From the transformation you can access the rotation matrix by getting the 3x3 matrix (mat[:3,:3]):

            # World to camera transform (row-major) (transform a point from world coordinate to camera coordinate)
            world_to_camera_tf = camera_params["cameraViewTransform"].reshape(4, 4)

            # Object world space to camera frame transform (row-major)
            obj_to_camera_tf = world_to_camera_tf @ local_to_world_tf
            obj["object_to_camera_transform"] = obj_to_camera_tf.tolist()

            # Extract camera frame location (last row) and rotation matrix (3x3) from the row-major transform matrix
            location_camera_frame = obj_to_camera_tf[3, :3]
            obj["location_camera_frame"] = location_camera_frame.tolist()
            rotation_matrix_camera_frame = obj_to_camera_tf[:3, :3]
            obj["rotation_matrix_camera_frame"] = rotation_matrix_camera_frame.tolist()

Thank you very much again Mr. ahaidu.

Please update the annotator documentation. Typo: CameraViewTransform is described as camera to world transformation matrix and you stated as World to camera transform.

        # World to camera transform (row-major) (transform a point from world coordinate to camera coordinate)
        world_to_camera_tf = camera_params["cameraViewTransform"].reshape(4, 4)

with open(json_path, 'r') as file:
        data = json.load(file)

        camera_view_transform = data["cameraViewTransform"]
        camera_view_transform = np.array(camera_view_transform).reshape(4, 4)
        T_cw = camera_view_transform

        camera_projection = data["cameraProjection"]
        camera_projection = np.array(camera_projection).reshape(4, 4)

BBtrans = np.load(npy_path, allow_pickle=True).tolist()
for entry in BBtrans:
        transformation_matrix = entry[7]
        T_wo = transformation_matrix

        T_co = T_cw @ T_wo
        print(T_co)
        T_co = T_co[:3,:3]

Output is:
object to camera transform. The rotation doesn’t make norm “1” in each column so its not the same as I posted “cam_R_m2c” i.e homogeneous.

[[-0.00321441 -0.00882502 -0.00343316 0. ]
[ 0.00893321 -0.00162352 -0.00419071 0. ]
[ 0.00314093 -0.00441398 0.00840544 0. ]
[ 0.02578111 0.00803663 1.3932508 1. ]]

the bounding box 3D annotator transformation matrix is not orthogonal. Any idea why is it so?

The rotation matrices seem to be orthogonal:

import asyncio
import numpy as np
import omni.usd
import omni.replicator.core as rep

omni.usd.get_context().new_stage()

cam = rep.create.camera(position=(1, 0, 0), look_at=(0, 0, 0))
rp = rep.create.render_product(cam, (512, 512))
rep.create.cube(position=(0.1, 0.2, 0.3), rotation=(10, 15, 30), semantics=[("class", "cube")])
cam_params_annot = rep.annotators.get("camera_params")
cam_params_annot.attach(rp)

bb_3d_annot = rep.annotators.get("bounding_box_3d_fast")
bb_3d_annot.attach(rp)

def check_if_matrix_is_orthogonal(M):
    det = np.linalg.det(M)
    if not np.isclose(abs(det), 1):
        print(f"Matrix is not orthogonal, determinant={det}")
    
    M_T = M.transpose()
    identity_matrix = np.identity(M.shape[0])
    if np.allclose(np.dot(M, M_T), identity_matrix, atol=1.e-6):
        print(f"Matrix is orthogonal, M*M_T={np.dot(M, M_T)}")
    else:
        print(f"Matrix is NOT orthogonal, M*M_T={np.dot(M, M_T)}")

async def get_camera_params_async():
    await rep.orchestrator.step_async()
    camera_params = cam_params_annot.get_data()
    camera_view_tf = camera_params["cameraViewTransform"].reshape(4, 4)
    camera_view_rot = camera_view_tf[:3, :3]
    print(f"Check if camera_view_rot is orthogonal: {camera_view_rot}")
    check_if_matrix_is_orthogonal(camera_view_rot)

    bounding_box_3d = bb_3d_annot.get_data()
    for bbox in bounding_box_3d["data"]:
        local_to_world_tf = bbox["transform"].reshape(4, 4)
        local_to_world_rot = local_to_world_tf[:3, :3]
        print(f"Check if local_to_world_rot is orthogonal: {local_to_world_rot}")
        check_if_matrix_is_orthogonal(local_to_world_rot)

asyncio.ensure_future(get_camera_params_async())

Hi!

With the following information of the annotators:

  1. 3D bounding box transformation
  2. camera_transformation
  3. camera_projection matrix

Using the provided helper functions by @ahaidu its seems complicated to calculate this RT matrix.

How can i get the RT matrix mentioned in the script attached below using the above information from isaac annotators?
Note: ‘RT’ is a numpy float64 array of shape (3, 4) containing homogeneous transformation matrices per object into camera coordinate frame.

            RT = np.linalg.inv(RT)                
            RTs[idx] = RT[:3]
            center_homo = cam_intrinsic @ RT[:3, [3]]

When we scale down the object then the local to world transformation matrix is no more orthogonal. Just replacing the below line in your provided script and you can observe it.

rep.create.cube(position=(0.1, 0.2, 0.3), rotation=(10, 15, 30), scale=0.5, semantics=[(“class”, “cube”)])

How many maximum number of frames we can generate using replicator once?
Any source or document that specify some numbers?

Hardware specifications:
NVIDIA GeForce RTX 3090 GPU with a boost rate of 1.7 GHz and
24 GB of memory