# Insight into world to camera transform for 3D bounding box

I am hitting a bug when I attempt to transform 3D bounding box points from the world into its 2D projection format. It seems as though I have a translation (maybe between sensor origin and where image view is?), as is seen in the images below:

My math is as follows, using the information provided to me in the sampling_log.yaml file produced following data generation and knowledge of the camera coordinate system (+Y Up, -Z Forward):

1. I first find the camera in the world frame by using the rotation of the camera object in the world frame outputted in the log file and a rotation into the (+Y Up, -Z Forward) camera coordinate system:

2. I then create the 4x4 homogenous transform using the camera coordinates in the world frame outputted by the log file:

3. Then for each point provided in the .npy file for the cuboid I transform it into the camera frame as such:

4. Then I pass in the points in the camera frame into the following openCV function: cv2.projectPoints(cuboid, t_vec, r_vec, camera_intrinsic_mat, distortion_coef) where the camera intrinsics matrix is
. This provides me with the output seen above, and I am unsure as to where I might be going wrong.

Any insight would be greatly appreciated! Many thanks in advance.

Hi Noelle,

I have a couple questions -
Part 1: It appears you are trying to rotate the world frame, rather than the camera object to the camera (both within the same world frame). Do Rx and Ry effectively accomplish the rotation from the camera object to the camera? If so, then shouldn’t Rx * Ry = cobj_R_c ? So in that case, w_R_c = w_R_cobj * cobj_R_c instead of w_R_c = Rx * Ry * w_R_cobj. I don’t have enough context on what the difference of the camera object and camera rotation are though to make an accurate assessment of this.

Part 2: Look good.

Part 3: Looks good. But keep in mind that each point [Px, Py, Pz] should be a 3D bounding box corner in the global reference frame. Can you ensure that is the case? Before going to the bounding box corners, I suggest starting out with only the centroid of the bounding box (1 point instead of 8) and making sure it projects to the center of the object. This may help to debug if the issue is a bounding box offset rather than a cuboid pose issue. With that in mind, try passing the cv2.projectPoints function a single 3D point for each cuboid rather than 8 points.

Part 4: Looks good, but you didn’t mention the distortion coefficients. Did you set these to anything? Is the distortion negligible?

If you have a short snippet of python code with these computations, you can paste it here for me to reproduce so I can try to help you debug things.

-Henry

Hi Henry! Thanks for getting back to me! Let me know if this answers your questions:

Part 1: The R_x* R_y is trying to rotate from the camera origin to the camera view, so in my understanding Rx*Ry = cobj_R_c is what I was going for. The thought process behind this was because I believed the coordinates that were being provided for the camera rotation in world view were from the camera sensor itself and not from the camera view, and so that was my attempt to rotate the object into camera view. I tried w_R_cobj * cobj_R_c too and the result also had a constant offset.

However, I found the below lines of code in the /isaac_sim-2022.1.0/tools/composer/src/scene/asset/camera.py file this morning:

and so rather than doing the R_x*R_y, I am taking the euler angles I get from the log file and doing the exact same thing as is done in that code (camera_rpy+[90, 0, 270]). However this provides a result that almost seems mirrored over the x-axis of the image, seen below:

. But even these aren’t an exact mirror because there’s a slight translation offset.

Part 3: Yes! This is the case, the bounding box corner is in the global reference frame, and I only transform a single point at a time and passing in only a point provides me an identical value.

Part 4: I set all distortion coefficients to 0 because my understanding was that they were negligible.

I will follow up with a short snippet of code with the images and log file I am pulling from!

Thanks Again,
Noelle

Okay! Following up with a folder of code that has the sampling_log.yaml file, a test image, a corresponding .npy file, and generate2dproject.py that does the calculations.

get2dprojection.zip (1022.1 KB)

The main methods that do the calculations are `transform_bb3d_data(bb3d_data_world_view, H_world2camera, camera_intrinsics)` , `get_4x4_mat(t, eul)`, and then `produce_frames(data, frame_id, class_id)` (which just gets the data from the .yaml file and adds the [90,0,270] to the retrieved rpy of the camera object). Let me know if this is sufficient!

Best,
Noelle

Hi Noelle,

I wasn’t able to fully debug it and make it work, but I do a suggestion. The main issue is I saw you computed the `camera_rpy` rotation using np.add. I believe you were trying to add the x rotation of 90, a z rotation of 270, and the world to camera object transform of 3 floats from the dict. Was this your full implementation of part 1, plus the conversion to a rotation matrix? If so, I do not believe you can compute the rotations this way. You need to use `np.matmul` between different rotation matrices to do it. I attached a code sample here, but I don’t believe I have the correct Rx and Ry matrices to transform the camera to camera object.
get2Dprojection.py (9.6 KB)

Also, are you sure that the values in the camera rotation of the dict are XYZ euler angles? If they are a different representation (e.g. direction cosines, ZYX angles, ZYZ angles) it could be causing problems. I added a new function to your code that explicitly converts a set of XYZ euler angles to a rotation matrix.

Hi Henry,

Thanks for the help! I used the np.add() to try and replicate what the source code ( /isaac_sim-2022.1.0/tools/composer/src/scene/asset/camera.py) was doing to the angles fed in from the parameter file in hopes that it would provide better visuals for the bounding box than manually doing the w_R_cobj@cobj_R_c aforementioned.

I also did confirm that they were XYZ euler angles as is seen in here:
.

I ended up scaling fx by -1 and fy by 1.3 to fix the mirror and translation, which got me the following results, but not a perfect solution because the scaling was by intuition and not by any values retrieved in code / from the simulator:

Thanks again for the help,
Noelle

Noelle,

This is helpful in finding the issue – your Fx and Fy should always be positive. So I believe the issue with the flip fx may be caused by something else.

Anyhow, did you happen to notice that your current value of Fy (1281.95) * 1.3 is roughly equal to Fx? I replaced your code with this:

``````
camera_intrinsics = np.array([
#[1662.99, 0, 960],
#[0, 1281.95, 540],
#[0,       0,   1]
#])
[1662.99, 0, 960],
[0, 1662.99, 540],
[0,       0,   1]
])
``````

And the results look pretty good. Is it possible there was an error in your computation of Fy = 1281.95?

Hi Henry,

Thanks for the quick response. I didn’t even piece that together the 1.3*fy ~ 1662.99, but you’re completely right that it works that way. I have double checked the calculation for fy and it should be right, but I am also just pulling those values from the provided samling_log.yaml file, where fx, fy is on the bottom:
.

I switched the rotation back to my original inclination where it rotated the world axes (+X forward, +Z Up) to camera axes (-Z Forward, +Y Forward) and that more or less fixed the mirror issue:

Thanks for all the insight! I guess my question is if there is something in the simulator that is setting fy to fx with the camera?

Best,
Noelle

Hm - I’m not sure - I would have to go through the simulator code to understand it better to see where it came from. But I’m glad you got things working to the degree you have so far!

Any idea why the center of projection is quite near but the orientation is not correct? Below is the code and the correct pose of the object in isaac sim attached at the end.

``````import numpy as np
import cv2
import json
import os
import matplotlib.pyplot as plt

def project2d(intrinsic, point3d):
return (intrinsic @ (point3d / point3d[2]))[:2]

axis_len = 30

image_path = '/Test_Prop_test/000013.png'

json_path = '/Test_Prop_test/camera_params_000013.json'
npy_path = '/Test_Prop_test/bounding_box_3d_000013.npy'

image= np.array(img)
height, width = image.shape[:2]

cam_intrinsic = np.array([
[641.37, 0.0, 320],
[0.0, 659.22, 240],
[0.0, 0.0, 1.0]])

with open(json_path, 'r') as file:

camera_view_transform = data["cameraViewTransform"]
camera_view_transform = np.array(camera_view_transform).reshape(4, 4)
T_cam_world = camera_view_transform.T
print("T_cam_world : ", T_cam_world )

camera_projection = data["cameraProjection"]
camera_projection = np.array(camera_projection).reshape(4, 4)
camera_projection = camera_projection.T

for entry in BBtrans:
obj_id = entry[0]+1
transformation_matrix = entry[7]
T_obj_to_world = transformation_matrix.T

if obj_id==1:
# Calculate T_world_obj by taking inverse of T_obj_world
T_world_to_obj = np.linalg.inv(T_obj_to_world)
# Calculate T_world_to_cam by taking the inverse of T_cam_to_world
T_world_to_cam = np.linalg.inv(T_cam_world)
T_cam_to_obj = T_world_to_obj @ T_world_to_cam
RT =  np.linalg.inv(T_cam_to_obj)
RT_centre = RT[:3, -1]
RT_centre = cam_intrinsic @ (RT_centre / RT_centre[2])

RT_centre = RT_centre[:2]
print("obj_center: ", RT_centre)
cv2.circle(img, tuple(RT_centre.astype(int)), radius=2, color=(0, 255, 0), thickness=5)

rgb_colors = [(0, 0, 255), (0, 255, 0), (255, 0, 0)]
for j in range(3):
obj_xyz_offset_2d = project2d(cam_intrinsic, RT[:3, -1] + RT[:3, j] * 0.001)
obj_axis_endpoint = RT_centre + (obj_xyz_offset_2d - RT_centre) / np.linalg.norm(obj_xyz_offset_2d - RT_centre) * axis_len
print("obj_xyz_offset_2d: ", obj_xyz_offset_2d)
print("obj_axis_endpoint", obj_axis_endpoint)
cv2.arrowedLine(img, (int(RT_centre[0]), int(RT_centre[1])), (int(obj_axis_endpoint[0]), int(obj_axis_endpoint[1])), rgb_colors[j], thickness=2, tipLength=0.15)

img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img_rgb)
plt.title('Image with Arrow')
plt.axis('off')
plt.show()
``````

Hi,

Which orientation and center of projection are you referring to?

The last image is the desired one and the first image was the one which calculated using annotator information.

I am referring to project the 3D pose of object from camera space on to the 2D image.

So basically i need the pose (orientation (3x3) and translation(3x1)) of object to camera to use as an ground truth pose for my training part in pose estimation.

It seems that the bounding box 3D params and the camera params are not sufficient to get the ground truth pose of all the objects of interest in the image.

Ah, I see. Yes, debugging transforms :) this can be hard to do without having the exact same setup as you do and going through them 1-by-1. It looks like whatever rotations you have in your kinematic chain have positioned the box appropriately, which seems to indicate most of your rotations are correct, but perhaps not the last one because the last rotation does not affect the position.

It’s also possible that you have some intermediate transforms that are wrong but coincidentally they are positioning the frame at the correct position.

Can you visualize all of the reference frames in the kinematic chain and annotate them so I know which TF is corresponding to each?

when I generate the synthetic data from replicator using the camera with look at a certain asset as in the script below:

``````        with camera:
rep.modify.pose(
position=rep.distribution.sequence(camera_positions),
look_at="/env/cracker_box"
``````

The camera points only to the center of the particular asset “/env/cracker_box” and dont consider the pose for other assets. If we look at all the assets then it would take the mean point of all assets and the camera directed towards that mean point rather than towards the center of any asset.

So if i try to draw the other object local coordinates it will not display on the 2D image because there centers are not calculated correctly due to view transformation matrices being suitable with only the cracker box or whatever you look_at in replicator camera while generating synthetic data.

Still, i am still unable to figure out that whether is it possible to get a single camera view transformation for all the objects in an image with a single camera or not?

Hi, you should be able to provide the `look_at` a distribution, i.e. put a list of prim paths inside of a distribution sequence so it can focus on a different object each frame.

In terms of getting the transforms of all the objects in the scene, we unfortunately don’t have that exposed yet but I can provide you some code to help. I have done some 3d bbox projections with the replicator data - i assume you have been using basic writer?

Parsing bbox data:

``````
class ObjectDetection:  # this can store both 2D and 3D bounding box data
def __init__(self, object_type=None, npy_bbox2d=None, npy_bbox3d=None):
if object_type is not None:
self.type = object_type
else:
self.type = None
if npy_bbox2d is not None:
self.xmin = npy_bbox2d[1]
self.ymin = npy_bbox2d[2]
self.xmax = npy_bbox2d[3]
self.ymax = npy_bbox2d[4]

if npy_bbox3d is not None:
self.xmin = npy_bbox3d[1]  # these mins and maxes specify the bbox extents of the canonical shape size.
self.ymin = npy_bbox3d[2]
self.zmin = npy_bbox3d[3]
self.xmax = npy_bbox3d[4]
self.ymax = npy_bbox3d[5]
self.zmax = npy_bbox3d[6]
self.T = np.transpose(npy_bbox3d[7])  # the diagonal elements scale the mins and maxes of the bbox extents
``````

parsing camera params:

``````class CameraParams:  # this stores camera params, which are required for projecting a 3d bounding box into image space
def __init__(self, camera_params):
self.cameraFisheyeMaxFOV = camera_params["cameraFisheyeMaxFOV"]
self.cameraFisheyeNominalHeight = camera_params["cameraFisheyeNominalHeight"]
self.cameraFisheyeNominalWidth = camera_params["cameraFisheyeNominalWidth"]
self.cameraFisheyeOpticalCentre = camera_params["cameraFisheyeOpticalCentre"]
self.cameraFisheyePolynomial = camera_params["cameraFisheyePolynomial"]
self.cameraModel = camera_params["cameraModel"]
self.cameraNearFar = camera_params["cameraNearFar"]
self.cameraProjection = np.transpose(np.array(camera_params["cameraProjection"]).reshape(4, 4))
self.cameraViewTransform = np.transpose(np.array(camera_params["cameraViewTransform"]).reshape(4, 4))
self.metersPerSceneUnit = camera_params["metersPerSceneUnit"]
``````

project to image coords:

``````
def proj_3dbbox_to_image_coords(bbox3d_list, camera_params, im_size, im_center):
r"""
Takes as input a list of 3D bounding boxes in the world frame and projects them into 2D using the camera extrinsics.

Args:
bbox3d_list: A list of 3D bounding boxes in the world frame, each of which contain min / max extents for x, y, z, and a 4x4 world transform.
camera_params: Camera parameters that include a view transform matrix and a projection matrix.
im_center: coordinates for the center of the image (H/2 and W/2)

Returns:
A list of 2D image coordinates for all 8 corners of the bounding box.

"""
im_size_proj = list(im_size)

data = {}
data["bounding_box_3d"] = {}
data["bounding_box_3d"]["x_min"] = []
data["bounding_box_3d"]["y_min"] = []
data["bounding_box_3d"]["z_min"] = []
data["bounding_box_3d"]["x_max"] = []
data["bounding_box_3d"]["y_max"] = []
data["bounding_box_3d"]["z_max"] = []
data["bounding_box_3d"]["transform"] = []

for bbox3d in bbox3d_list:
data["bounding_box_3d"]["x_min"].append(bbox3d.xmin)
data["bounding_box_3d"]["y_min"].append(bbox3d.ymin)
data["bounding_box_3d"]["z_min"].append(bbox3d.zmin)
data["bounding_box_3d"]["x_max"].append(bbox3d.xmax)
data["bounding_box_3d"]["y_max"].append(bbox3d.ymax)
data["bounding_box_3d"]["z_max"].append(bbox3d.zmax)
data["bounding_box_3d"]["transform"].append(np.transpose(bbox3d.T))

bbox3d_corners = get_bbox_3d_corners(data["bounding_box_3d"])
bbox3d_proj = []

for i, bbox3d in enumerate(bbox3d_list):
points_in_world_frame = bbox3d_corners[i]
if camera_params.cameraModel == "fisheyePolynomial" or camera_params.cameraModel == "ftheta":
points_proj = project_fish_eye_polynomial(points_in_world_frame, camera_params)
im_size_proj = [camera_params.cameraFisheyeNominalWidth, camera_params.cameraFisheyeNominalHeight]
# print(points_in_world_frame, points_proj, im_size_proj, im_center, "points")
else:
points_proj = project_pinhole(points_in_world_frame, im_center, camera_params)
bbox3d_proj.append(list(points_proj))

return bbox3d_proj, im_size_proj
``````

overlay on the image:

``````
def sceneOverlay3dbbox(im_size, bbox3d_proj, bbox3d_data_list, bbox_tags):
r"""
Creates 3D bounding box overlays for each bounding box in the image using the omni.ui.scene library.

Args:
im_loc: 2D image location in the viewport. Typically this is [0, 0] for viewing a single image, or otherwise some shift in the x direction if viewing side by side.
im_size: The size of the image in pixels.
bbox3d_proj: A list of lists, where each sublist contains points for the 8 corners of the 3d bounding box projected in image space.
bbox_data_list: A list of 3D bounding box data, where each bounding box includes a type attribute to denote the class tag.
bbox_tags: A list of the overlay class tags that should be included in the overlay. If bbox_data.type is not in bbox_tags, it will not appear.

"""

# move the bounding boxes up or down depending on aspect ratio of the image
if im_size[1] > im_size[0]:
if im_size[1] < im_size[0]:

im_aspect_ratio = float(im_size[0]) / float(im_size[1])
im_scale_factor = 2 * max([im_aspect_ratio, 1.0 / im_aspect_ratio])

scene_view = sc.SceneView(
aspect_ratio_policy=sc.AspectRatioPolicy.PRESERVE_ASPECT_FIT, screen_aspect_ratio=im_aspect_ratio
)
with scene_view.scene:
with sc.Transform(transform=sc.Matrix44.get_scale_matrix(im_scale_factor, im_scale_factor, im_scale_factor)):
for idx in range(len(bbox3d_data_list[0])):
point_set = bbox3d_proj[idx]
point_set = bbox3d_proj[idx]
bbox_data = bbox3d_data_list[0][idx]
if bbox_tags[bbox_data.type] == True:
# colour = data_to_colour(bbox_data.type, 0.6)
colour = list(np.array(rv.overlay.data_to_colour(bbox_data.type)) / 255.0) + [0.6]
u = {}
v = {}
for corner in range(8):
u[corner] = point_set[corner][0] / max(im_size) - 0.5 * x_adj
v[corner] = -point_set[corner][1] / max(im_size) + 0.5 * y_adj

line_coord_list = [
[u[0], v[0], u[1], v[1]],
[u[0], v[0], u[2], v[2]],
[u[0], v[0], u[4], v[4]],
[u[1], v[1], u[3], v[3]],
[u[1], v[1], u[5], v[5]],
[u[2], v[2], u[3], v[3]],
[u[2], v[2], u[6], v[6]],
[u[4], v[4], u[5], v[5]],
[u[4], v[4], u[6], v[6]],
[u[5], v[5], u[7], v[7]],
[u[7], v[7], u[6], v[6]],
]

for line_coord in line_coord_list:
# check if both are out
c1 = rv.overlay.check_coord(
)
c2 = rv.overlay.check_coord(
)

if c1 is True and c2 is True:
sc.Line(
start=(line_coord[0], line_coord[1], 100),
end=(line_coord[2], line_coord[3], 100),
color=colour,
thickness=2,
)

else:

if c1 is False and c2 is False:
continue

x_point = rv.overlay.find_intersection(
)

if c1 is False and c2 is True and x_point[0] is not None:
sc.Line(
start=(x_point[0], x_point[1], 100),
end=(line_coord[2], line_coord[3], 100),
color=colour,
thickness=2,
)

elif c1 is True and c2 is False and x_point[0] is not None:
sc.Line(
start=(line_coord[0], line_coord[1], 100),
end=(x_point[0], x_point[1], 100),
color=colour,
thickness=2,
)

for idx in range(len(bbox3d_data_list[0])):
point_set = bbox3d_proj[idx]
point_set = bbox3d_proj[idx]
bbox_data = bbox3d_data_list[0][idx]
if bbox_tags[bbox_data.type] == True:
# colour = data_to_colour(bbox_data.type, 0.6)
colour = list(np.array(rv.overlay.data_to_colour(bbox_data.type)) / 255.0) + [0.6]
u = {}
v = {}
for corner in range(8):
u[corner] = point_set[corner][0] / max(im_size) - 0.5 * x_adj
v[corner] = -point_set[corner][1] / max(im_size) + 0.5 * y_adj
u_label = min([u[0], u[1], u[2], u[3], u[4], u[5], u[6], u[7]])
v_label = max([v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]])

if -0.5 * y_adj - 0.02 <= v_label <= 0.5 * y_adj and -0.5 * x_adj - 0.02 <= u_label <= 0.5 * x_adj:
with sc.Transform(transform=sc.Matrix44.get_translation_matrix(u_label, v_label, 0)):
sc.Label(bbox_data.type, alignment=ui.Alignment.LEFT_BOTTOM, color=colour, size=14)
``````

let me know if you get stuck again!

Vielen Dank!
Thank you very much hclever. It helps me a lot.

Last query:

Lets say I have two assets and my camera is looking at them just as in the code below and i want the transformation matrix from camera to each model/object just as a sample output specified at the end of post. Its a homogeneous transformation matrix as the norm of each column of the rotation matrix is equal to 1.

``````with camera:
for asset in ["/env/cracker_box", "/env/blue_can"]:
rep.modify.pose(
position=rep.distribution.sequence(camera_positions),
look_at=asset
)
``````

Desired Output:
“cam_R_m2c”: [
0.8180317284610569, -0.5726349904075037, 0.05397461432945827,
-0.32806818289014184, -0.5416101454156569, -0.773970101333373,
0.47243556026846245, 0.6154247461256569, -0.6309176041687177
],
“cam_t_m2c”: [
-0.2086800870850169,
-0.6871454771379741,
0.5223745549200488
],

Object to camera homogeneous transformation matrix

``````with open(json_path, 'r') as file:

camera_view_transform = data["cameraViewTransform"]
camera_view_transform = np.array(camera_view_transform).reshape(4, 4)
T_cw = camera_view_transform

camera_projection = data["cameraProjection"]
camera_projection = np.array(camera_projection).reshape(4, 4)

for entry in BBtrans:
transformation_matrix = entry[7]
T_wo = transformation_matrix

T_co = T_cw @ T_wo
print(T_co)
T_co = T_co[:3,:3]
``````

Output is:
object to camera transform. The rotation doesn’t make norm “1” in each column so its not the same as I posted “cam_R_m2c” i.e homogeneous.

[[-0.00321441 -0.00882502 -0.00343316 0. ]
[ 0.00893321 -0.00162352 -0.00419071 0. ]
[ 0.00314093 -0.00441398 0.00840544 0. ]
[ 0.02578111 0.00803663 1.3932508 1. ]]