Replication output data types - pose?

It would be very useful for “pose” to be an Output data type - along w/ rgb, instance, semantic_seg, etc.

As question - is there a way to get “pose” out of replicator composer?

This would be used to feed an object pose algorithm (I’m currently using EfficientPose)

Thanks!
p

1 Like

Hello @peter.gaston! I’ve reached out to the team about your questions. I will report back here when I have more information!

Hi @peter.gaston , you can use the transform data output by the bounding_box_3d annotator as the pose.

1 Like

thx! lots of hidden data lurking around, eh? Excellent!

Hi @peter.gaston.

I hope you are doing well. I am also interested in working with 6D pose estimation deep learning methods. Were you able to train any model with your dataset? All the work I see is based on the YCB video dataset only.

Thanks,
Mayank

Not sure exactly what your question is - I’ll throw out some ideas - feel free to be more specific…

I have an ML model doing pose estimation using synthetic data (and real data). The synthetic data is composed primarily using replicator. Per the topic here, replicator does not expose the camera pose. However, if one sets the camera at, say 0,0,0.5 pointing 0,0,90 (or whatever) - then one can easily deduce the camera pose. i.e., don’t move the camera - move everything else. I created 65,000 images for my initial domain randomization training - and several thousand more so far to test in more reasonable conditions - see below.)

For my case, it’s not really 6D pose, given the exact environment (pallets in a warehouse) it’s really only X, Y and a yaw. The floor constrains the rest.

We’ve played with various models. We’ve used EfficientPose, a 2 stage mask-RCNN followed by either a direct to pose ML or a geometry based algorithm, or currently a key point based approach followed by a geometry algorithm. Your mileage will vary. We like the key point as it’s human explainable for failure modes - and seems easier to understand how to identify ways to further train the model to fix those failures.

So I would recommend using replicator to create a boat-ton of synthetic images to train on and work from there.

Example image:

Thanks a lot for such a quick and detailed reply. I have few more questions please bear with me as I am new to this.

  1. For pose-related models is a camera pose required? (As you mentioned to keep the camera fixed)
  2. When I calculated the camera intrinsic matrix (using this guide) to convert the depth image to pointcloud, I noticed that it is not accurate. How are you calculating the camera intrinsic matrix ?

Is a camera pose required. Well, yes. You want the ground truth position of whatever you’re looking at in relation to the camera.

To get the camera matrix, I cheated. I used another method that works and outputs the camera pose (incl intrinsic matrix). section 2.6 on page https://docs.omniverse.nvidia.com/app_isaacsim/app_isaacsim/tutorial_replicator_recorder.html - except my code is:

import omni.replicator.core as rep

with rep.new_layer():
camera = rep.create.camera()
with rep.trigger.on_frame():
with camera:
rep.modify.pose(
position=(0.0,0.0,0.35),
rotation=(0, 0, 180),
)

Hi @peter.gaston. I thank you for replying so quickly with all the details.

I assume that you are using transform data from “3D bounding box” to get the rotation and translation of the object. Were you able to achieve good accuracy using the synthetic data ?

If you mean good accuracy from the transforms, yes. Spot on. Simple matrix multiplication. threeDXForm is the 3d transform from Isaac. camTransform is what Isaac would return if they had implemented that.

        self.fixedXf = np.matmul(self.palletFacePts,self.threeDXForm)
        self.camPalletPts = np.matmul(self.fixedXf,self.camTransform)

If you mean good pose data, that depends on the algorithm. I currently use a two-stage approach. First stage is key points - they’re averaging 1-2 pixels off (L2) which is very good, all things considering. Then I do some geometry to get pose - which is fine. So meeting targets at present. My ML algorithm has an input of 1/3 the size of the image - so that loses pixels. And of course one can always be off one pixel due to rounding. And the ML key point itself can lose. So all in, I’m happy.