3D object Pose Estimation with Pose CNN decoder taking too long to run

I followed the page https://docs.nvidia.com/isaac/isaac/packages/object_pose_estimation/doc/pose_cnn_decoder.html to train a pose estimator for my 3D model but it takes too long to train. To give some metrics I ran the training for 50000 iterations which took 2d 11 hrs to complete on my RTX2080 Super. The loss was still high at 0.8822.

Hello RKJha, can you share more details on your set up?

Can you provide more information on which object you used for training? Apparently your model has not converged and just hit the max number of iterations. Also have you been tried to visually check the results? or using accuracy metric and not just looking at loss?

I used it for a custom object. It is an assembly line object, a CPU fan in particular. The loss value had been stagnant for a long time. It was only decreasing 0.01 for every 10000 iterations after 30000 iterations. I checked the results by running inference on it too.

We have shifted to training of duplo blocks now with a new setup.

Training details:
Achieved loss- 0.5
Iterations- 50K
GPU- 2060 super
Training time- 1 .5 days

I’m guessing that the model is trained properly by looking at the decoder output and reconstructed images.
input_image decoder_output reconstructed_image

But the inference result shown below doesn’t seem right.
inference

Issue 1: I’m thinking that we are not doing anything wrong in training but doing some mistake in setting the config parameters for inference.

It is observed that the pose bounding box is always at the optical centre of image. The documentation says that we need to provide the 3D bounding box size at zero orientation and the transformation from the object centre to the bounding box centre. But there is no much detail on how exactly we can use the bounding box prediction by object detection model to calculate the required transformation. A brief description on how to configure these parameters would be really helpful.

I’ve also looked into the Feature Map Explorer (https://developer.nvidia.com/blog/improving-the-isaac-3d-pose-estimation-model-with-feature-map-explorer/), but I believe that this tool is helpful in choosing the right hyper parameters for training but doesn’t talk about inference parameters.

Issue 2: Is the training time of 1.5 days for 50k iterations normal? And the loss achieved is only 0.5.

Hi rkjha1,

Training time
The PoseCNN takes a while to train yes. The training time could be around that if you are using only 1 Sim instance feeding data to the training pipeline. This is because there is a limitation in the datarate that comes out of Sim. Depending on your GPU utilization it might be ok to have more than one Sim instances running. I would say it would be more advisable to do this in a setup where you have more than one GPU available.
Now it would be interesting to check your GPU utilization during training (nvidia-smi) and check the datarate you are getting:

  • In packages/object_pose_estimation/apps/pose_cnn_decoder/training/training.app.json you can find a component ChannelMonitor in the node pose_estimation_training_samples. You should be able to plot this in IsaaSight and see the data rate you are getting
  • Other thing you can do is run the scene in binary mode instead of using the Editor to increase the framerate. For this, once you have your setup/scene ready you can build your project (Isaac -> Build Scenes in Folder or Build Scenes in List). This will give you a similar project like the ones you can find in isaac_sim_unity3d/builds/ -> project_name_Data/ folder and project_name.x86_64 binary file. You can then run this as you would do with the default scenes described in 3D Object Pose Estimation with Pose CNN Decoder
  • You can also disable the GUICamera for higher FPS. In the Unity Editor -> the pose_cnn_decoder_training scene has a object called GUICamera. Click on it, and in the Inspector disabled the Camera (you can do this before building your scene)

Other than that, you can always check the output of the reconstructed images in tensorboard - you might not really need the 50k iterations. You can stop midterm, run inference and if you are not happy with the result just re-start the training from the previous checkpoint.


Either way, your achieved training loss looks quite high indeed. What loss is that? The total loss?
A couple of checks: Are you changing the camera parameters in Sim? Is the image dimension the same?
If you change image dimensions you should change the parameter of some components as well:

  • For training: packages/object_pose_estimation/apps/pose_cnn_decoder/training/pose_estimation_sim.subgraph.json -> bounding_box_padding
  • For inference: In packages/object_pose_estimation/apps/pose_cnn_decoder/pose_estimation_cnn.subgraph.json detection_convertor -> BoundingBoxEncoder. This codelet has a parameter named image_dimensions see here

Other important parameters for inference in is the Detections3Viewer configuration. Ex in packages/object_pose_estimation/apps/pose_cnn_decoder/detection_pose_estimation_cnn_inference_kltSmall.config.json

  • box_dimensions should have approximately your real bbox dimensions
  • object_T_box_center : reflects where where the center of the object is located in the prefab you have in sim. If the origin of the object is already the origin of the prefab then these should have translation all 0’s and rotation 1,0,0,0

From your training and inference images, I see one problem for sure -1. network is not trained enough as the decoder output is still blurry. Another point to add to Teresa’s suggestions that answer most of the problems you have, if the network is trained enough, you should be able to see the decoder output image (the middle one) much clearer - very close to the ground truth image you have on the right unless there are extreme occlusions. So from what I see, it indicates that the network is not trained enough. I would suggest to reduce the learning rate after first 15000 iterations or so…try reducing to 1e-4 (you can do this in training_config.json) and train it further and check, training loss should ideally reduce further.
Few things to keep in iterating on few that Teresa has already mentioned.

  1. Your camera parameters used in sim for training must match the real camera you are using for inference. These are set by default to real sense parameters in the scene. You can change them in camera GameObject in procedural_camera in the scene. I would suggest to first do inference in sim and make sure it works in that case before going to real world inference.
  2. You must do inference on the object with exact same size as you trained with in simulation. So different size lego blocks won’t work since we rely on the CAD file and camera parameters to connect the 2D image to compute 3D pose. It is very important that both the object size and camera parameters between training and inference match.

Connecting the 2d detections to compute the 3d pose is already taken care of in the inference app. You just need to provide the relevant config params like you see in the sample config files for dolly and box.

Thank you TeresaC and snimmagadda for your feedback and support.

Updates

Isaac Sim Setup
We went back to our training setup and configured the unity camera parameters accordingly to match our real world camera source. For testing, we used an iPhone 11 Pro. We found the focal length and sensor size (52mm and [45,45] respectively) and configured the camera parameters for all the cameras in the scene. We left the image dimensions as default but made sure to correctly scale our testing images correctly (with padding to maintain aspect ratio). We also made sure the duplo block CAD model we were using is to scale.

Training Setup
For our training, we ran through a few iterations before we found some values that sped up training. In our latest training attempt, the batch size was set to 4 and the learning rate was set to 1e-4 from the start. We also had noise off for the entire training time. We found this was the best approach to make sure the decoder output did not converge to white and continue to improve loss.

Results
We trained our model for 59K iterations and achieved a total loss of 0.17 in 2 days and 19 hours. Below you can see a sample of the decoder output.

When we tested the model, we found significant error in the translation and size. The estimated rotation appears to be correct but the position does not. Additionally, there are some instances where the object is facing 180 degrees in the wrong direction but that is likely due to the simplistic nature of the duplo block itself. We tested on both synthetic and real-world images. We also made sure to edit the configuration files accordingly, setting the translation and rotation values to match the prefab. We are not sure where this error is being introduced and what steps we could take to reduce it.

Next Steps
We’re not entirely sure what else we could do to improve the model results.

We believe we set the focal length values for training correctly but perhaps the values we used for inference are off. We saw the default focal length value is in pixels and so we used an online calculator to go from 52mm ~ 196.53 pixel (X). When we used 196.53 we found the estimated pose to be off. To generate better estimated pose values like the ones pictured above, we used the default inference value (925.74). The calculator may have given the wrong value but if it did how should we calculate the value ourselves?

What could other sources of error be for the translation and size issues in the images?

One final note: Reproducibility
Since training these models takes a while, we figured we should have multiple people training models with different configuration values to speed up experimentation time. Unfortunately, despite having identical training configurations, the model results are not reproducible across all our devices. For example, our workstation with an RTX 2060 Super and our other workstation with an 2080 Super are able to train the duplo block pose model fairly well. On our workstation with a Quadro P6000, the same configuration and setup produces an unusable model, always converging the decoder output to white. We’re not sure why we can’t reproduce the model results on the Quadro P6000 station , since the configurations are the same and the Isaac sim scene is largely the same (different camera parameters but object is still always visible). The only big difference (aside from the cards themselves) is the Quadro P6000 has a slightly older version of Isaac sim but it does match the Isaac SDK version. Is there anything known that could cause such a discrepancy?

Recap
To recap, the issues we are still encountering are as follows:

  1. Calculating the correct focal length for inference
  2. Significant error in translation & size
  3. Reproducibility across different hardware configurations

Thank you for your time.

In our last update, we mentioned three problems that we were facing.

  1. Calculating the correct focal length for inference
  2. Significant error in translation & size
  3. Reproducibility across different hardware configurations

We solved the issue of calculating correct focal length for inference and we are also now able to reproduce the same results across different hardware configurations. One issue we are still observing is the error in translation.

Previously, we always observed our pose bounding box to be at the centre of image, that is translation was never being learnt.

We did a change in our training dataset to overcome this. Instead of spawning the object always at the centre of the image, as done for dolly, we spawned our object randomly in the scene. By doing this, we could see that translation was being learnt slowly.
pasted image 0

Current issues:

  1. Rotation is not being learnt in all the cases, especially when the duplo block is placed as shown below.
    pasted image 0 (1)
  2. There is translation offset in some frames even after training for 117500 iterations (loss = 0.2).
  3. We were decreasing the learning rate whenever we observed that the loss has saturated. But, even after 117500 iterations, loss is 0.2 and it doesn’t go down anymore.

We tried everything we could think of. It will be a great help if the issues mentioned above are addressed.

Hi,

Isaac Sim Setup
Regarding the translation issue, yes you are right, the object need to be spawned all across the frames to learn it correctly. We have added this note and what params to change in the documentation edits for the upcoming release next month. We have also added some range and std deviation to Target Config in procedural camera > CameraGroup so that the objects are now spawned at the center all the time. However, these range and std deviation along with camera view point distance ranges need to be set depending on the size of the object of interest instead of using the default values provided in the scene.

The model gets as good as the provided training data. So I highly recommend playing around with Uniform Range and Standard Deviation values for Position and Target configurations in procedural camera > Camera Group in the scene and make sure that enough samples are coming in with the distance range of interest based on your application (how close is the target usually to the camera) and making sure that the object is spawned across the image frame and not going out of frame often. The default values are set for the dolly which is much larger than the blocks you are using, so you might not be receiving blocks data from close enough range.

Regarding rotation error, another thing you can do is randomize the orientation of the object as well every frame like in the video attached here by dragging object in the scene instead of using scenario manager and adding Randomizer Group and TransformRandomizer components to the Object, adding RigidBodiesSink script as well to the object (Set Rigid Body size to 2 and add the CameraFrame as first body and this object as second body by dragging from panel into Element 0 and Element 1 boxes on the right ) as shown in this video: https://drive.google.com/file/d/194qsZqkEEZgxUa7Ld2bLZt1ykt7HQ4ji/view?usp=sharing
Set full 360 degree rotation range to x, y and z axes in the Transform Randomizer > Uniform Range.
You can set the Frame Interval to 5 or so so that at every camera position you have 5 different rotations recorded to give more rotation samples per translation.

Training Time
For the upcoming release, we added support for offline training as well so that you can save the images before hand and train on the saved images to reduce the training time in case you don’t have access to multiple GPUs that can be costly for online training.

Camera parameters and Image resolution
We added support if inference focal length is different from training focal length as long as the aspect ratio (720 (Height): 1280 (Width)) of the training is maintained during the inference. However, if the focal lengths vary significantly, we do advise to train on the on the target inference camera focal parameters. So please try inference again with latest release.
Just to confirm, the focal length provided in the scene are in pixel units.

Can you add 2D bounding boxes computed for the above pose inference? Just want to make sure that the 2D boxes look correct and is not contributing to the pose error.
On another note, the upcoming release uses TRT 7.1 that supports only fixed batch size, so pose estimation with this model will support only single instance for now. So please make sure to first try out with only single block in the image.

1 Like

Thank you for getting back. I’ve tried to change the Unity setup as shown in the video to achieve different orientations for my custom object. But I’m not able to replicate the same results. I’m attaching a video which shows my Unity settings. We can observe that the duplo block is still being spawned with the same orientation. Please let me know if I missed out on something.
https://drive.google.com/file/d/15t915PfFi40U6k13-eDATG5NO1-6bkbN/view?usp=sharing

You need to click and select Randomize on Update config option in Randomizer group. If you want to randomize reset the randomize seed every time you start, you can click and select Randomize on Start option as well.

Thank you. Selecting ‘Randomize on update’ worked for me. For some reason “Randomize on start” didn’t work. Anyways, I’m able to achieve different orientations for the object now!

The inference results improved. Rotation and Translation prediction is better than before. But I have two concerns.

  1. The results on synthetic image generated from Unity setup is shown below.

    The result on a real-world image is:

There is a considerable translation offset and a slight rotation offset in almost all real-world images.

  1. Training was done for 92k iterations but the loss is still 0.3. The loss was not gradually decreasing, it had a lot of fluctuations. Batch size is 4 and learning rate is 1e-4. If I try to decrease the learning rate, the decoder out goes completely black. So I maintained my learning rate at 1e-4.

How can I improve the training to achieve a gradual decrease of loss, which might improve results on real world images? Apart from improving training, can I do something else to get better results?

92k iterations sound like too many. In any case, the slight rotation error is not alarming and I think it is close enough.
Regarding translation offset, if its working in sim and not on real world image, it might be due to camera parameters not being set correctly…Why are there black pixels at the right end? - is the image from the inference camera resized to get same image dimensions?
I recommend trying out the inference without any changes to mage size from inference camera but setting the focal length and image dimensions correctly with the Isaac 2020.2 release that went live yesterday, There were some upgrades made in the inference application to handle the inference camera to be different from training camera to some extent.