Poor/unknown quality of Pose Estimation Model

I’ve followed the steps outlined in the article on 3D Object Pose Estimation with Pose CNN Decoder

While not without hiccups, I managed to train both object detection model and 3D Object Pose Estimation CNN Decoder models.
Both models are trained on data from Isaac Unity - object detection model architecture is ResNet10 and it performs reasonably well, see the screenshots

The problem starts when I run a pose estimation pipeline - the decoder output looks normal in some instances(not all of them), but
a) It doesn’t clearly say in documentation how to specify the 3D bounding box size at zero orientation and the transformation from the object center to the bounding box center. I made a few attempts to guess the correct values and enter them in corresponding Isaac Sight widget, but that didn’t result in 3D bounding boxes making any sense still
b) I’m under impression that 3D bounding box visualizations and decoder output lag behind - meaning that inference is not performed on current camera image received from simulation.
Let me demonstrate the above problems with screenshots.

These two screenshots are the example of what seems to be like a lag in inference/visualization.

And this image is where decoder input seems to be making sense, but 3D bounding boxes are completely out of place.
I cannot attach my .json config for 3D Pose estimation app, so I put it into pastebin

Any more information I can provide? I can share both models if needed. 3D pose estimation model is trained for 25000 steps.