Insight into world to camera transform for 3D bounding box

Mr. hclever

Any idea about my queries?


I’ll take a look today -

If you multiply two valid transform matrices together, it should return a valid transform matrix. What are T_cw and T_wo in this example?

It’s also possible the world object transform has a scaling factor in it. The rotation component can contain the scale in each direction, so if you were to normalize to 1 in each column it the 3d normalization factor is the scale in that direction.


T_cw is camera view transform matrix and T_wo (object to world transformation matrix) is bounding box annotator transformation matrix.

T_wc: [[ 8.30862047e-01 6.67165675e-02 -5.52464622e-01 0.00000000e+00]
[-5.56478444e-01 9.96125985e-02 -8.24869124e-01 0.00000000e+00]
[-9.02056208e-17 9.92787102e-01 1.19890659e-01 0.00000000e+00]
[-2.30302610e-02 -1.44196906e+00 -5.70784420e-01 1.00000000e+00]]

camera_projection: [[ 2.00429483e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 2.67239311e+00 0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 9.99999988e-09 -1.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 9.99999988e-03 0.00000000e+00]]

T_wo: [[-2.1821405e-04 -4.1810372e-03 9.0813711e-03 0.0000000e+00]
[-1.0132407e-03 9.0460125e-03 4.1404110e-03 0.0000000e+00]
[-9.9461414e-03 -8.2981185e-04 -6.2103639e-04 0.0000000e+00]
[ 3.0821344e-02 4.6327435e-03 1.4499093e+00 1.0000000e+00]]

It seems like it depends on the world object transform, which is this:

T_wo: [[-2.1821405e-04 -4.1810372e-03 9.0813711e-03 0.0000000e+00]
[-1.0132407e-03 9.0460125e-03 4.1404110e-03 0.0000000e+00]
[-9.9461414e-03 -8.2981185e-04 -6.2103639e-04 0.0000000e+00]
[ 3.0821344e-02 4.6327435e-03 1.4499093e+00 1.0000000e+00]]

Where does T_wo come from? The columns of it are definitely not normalized so this would propogate to your result (T_co).

The 3D bounding box annotator in isaac sim has transform. It is T_wo which means object to world transformation.

It’s hard to say where the issue is from this information. You might try visualizing the reference frames at each point in the chain to help debug it?


With the following information of the annotators:

  1. 3D bounding box transformation
  2. camera_transformation
  3. camera_projection matrix

How can i get the RT matrix mentioned in the script attached below using the above information from isaac annotators?

Note: ‘RT’ is a numpy float64 array of shape (3, 4) containing homogeneous transformation matrices per object into camera coordinate frame.

        RT = np.linalg.inv(RT)                
        RTs[idx] = RT[:3]
        center_homo = cam_intrinsic @ RT[:3, [3]]

the issue is basically when we scale the objects then the transformation matrix from object to the world is not orthogonal and it leads to issues with the transformations like object to camera transformation matrix.

OK - let me try to understand - so you have a 3D bounding box tf, presumably in world coordinates? So let’s call it T_wo. Then you have a camera transform and a camera projection matrix, and you want to get the camera to object transform (T_co) and then put it in image space? Is that right?

And, if I am not mistaken, this works unless you alter the scale of the 3D bbox tf? Does it work when you alter the rotation or position of the 3D bbox tf?


Yes. But i need a rotation matrix and translation vector that represents the ground truth pose of the objects from object to camera coordinate frame.

T_wo represents object to world transformation matrix which we get from 3D bbox annotator.
T_co represents transformation matrix from object to camera.

I calculated somehow the camera intrinsic matrix manually but the T_co is not correct. Since the objects are scaled in my scenario so T_wo is not orthogonal and thats why the below fragmant of code will not project it to 2D image.

    RT = np.linalg.inv(RT)                
    RTs[idx] = RT[:3]
    center_homo = cam_intrinsic @ RT[:3, [3]] (259.0 KB)

Can anyone please point out the mistake in my code below to project on 2D image?

import numpy as np
import json
import cv2

def orthgonal_check(check_matrix, matrix_name):
is_orthogonal = np.allclose(, check_matrix.transpose()), np.eye(3), atol=1.e-6)
if is_orthogonal:
print(f"The matrix {matrix_name} is orthogonal.“)
print(f"The matrix {matrix_name} is not orthogonal.”)

def project2d(intrinsic, point3d):
return (intrinsic @ (point3d / point3d[2]))[:2]

image_path = r""
json_path = r""
npy_path = r"

img = cv2.imread(image_path)

intrinsic = np.array([
[659.22, 0.0, 320],
[0.0, 641.37, 240],
[0.0, 0.0, -1.0]])

with open(json_path, ‘r’) as file:
data = json.load(file)

    camera_view_transform = data["cameraViewTransform"]
    camera_view_transform = np.array(camera_view_transform).reshape(4, 4)
    T_wc = camera_view_transform
    orthgonal_check(T_wc[:3,:3], "T_wc")

    camera_projection = data["cameraProjection"]
    camera_projection = np.array(camera_projection).reshape(4, 4)
    camera_projection = camera_projection

BBtrans = np.load(npy_path, allow_pickle=True).tolist()
axis_len = 25

for entry in BBtrans:
obj_id = entry[0]+1
transformation_matrix = entry[7]
T_wo = transformation_matrix
# Check if the matrix is orthogonal
orthgonal_check(T_wo[:3,:3], “T_wo”)

    T_co = T_wc.T @ T_wo.T
    orthgonal_check(T_co[:3,:3], "T_co")

obj_pose = T_co
obj_center = project2d(intrinsic, obj_pose[:3, -1])
rgb_colors = [(0, 0, 255), (0, 255, 0), (255, 0, 0)]
for j in range(3):
obj_xyz_offset_2d = project2d(intrinsic, obj_pose[:3, -1] + obj_pose[:3, j] * 0.001)
obj_axis_endpoint = obj_center + (obj_xyz_offset_2d - obj_center) / np.linalg.norm(obj_xyz_offset_2d - obj_center) * axis_len
cv2.arrowedLine(img, (int(obj_center[0]), int(obj_center[1])), (int(obj_axis_endpoint[0]), int(obj_axis_endpoint[1])), rgb_colors[j], thickness=2, tipLength=0.15)

cv2.imshow(‘Arrowed Line Image’, img)
cv2.waitKey(0) # Wait for a key press to close the image window