I get the answer from [A few snippets which might be useful to compute camera intrinsics]
# compute focal point and center
focal_x = height * focal_length / vert_aperture
focal_y = width * focal_length / horiz_aperture
center_x = height * 0.5
center_y = width * 0.5
Now I can get the xyz
coordinates of the points, if I’m right, these are in the camera coordinate system? So I need to transform them into the world coordinate system (using the camera’s transformation matrix relative to the world coordinate system) and then I can match the coordinates I got for the 3D bbox above.