Centerpose_synthetic quickstart notebook error in sample dataset: KeyError: 'plane_center'

Please provide the following information when requesting support.

  • Hardware: GeForce RTX 4090 Laptop GPU
  • Software: Ubuntu 22.04
  • Network Type: centerpose_fan from centerpose_synth quickstart notebook, no changes made
  • TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): tlt is not installed locally using this notebook, docker tag is ISAAC Sim 4.0.0 and all dockers for nvidia/tao/tao-toolkit 5.5.0 dataset/deploy/model
  • Training spec file: default spec file from notebook train_synthetic.yaml:
results_dir: /results

dataset:
  train_data: /data/results/images
  val_data: /data/results/images
  num_classes: 1
  batch_size: 4
  workers: 8
  category: "pallet"
  num_symmetry: 1
  max_objs: 10

train:
  num_gpus: 1
  validation_interval: 20
  checkpoint_interval: ${train.validation_interval}
  num_epochs: 40
  clip_grad_val: 100.0
  seed: 317
  pretrained_model_path: /results/pretrained_models/centerpose_vtrainable_fan_small/centerpose_trainable_FAN_small.pth
  precision: "fp32"

  optim:
    lr: 6e-05
    lr_steps: [90, 120]

model:
  down_ratio: 4
  use_pretrained: False
  backbone:
    model_type: fan_small
    pretrained_backbone_path: /results/pretrained_models/centerpose_vtrainable_fan_small/centerpose_trainable_FAN_small.pth
  • Tao Mounts file:
{
    "Mounts": [
        {
            "source": "/home/mb/tao-experiments",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/mb/tao-experiments/data/centerpose",
            "destination": "/data"
        },
        {
            "source": "/home/mb/tao_tutorials/notebooks/tao_launcher_starter_kit/centerpose/specs",
            "destination": "/specs"
        },
        {
            "source": "/home/mb/tao-experiments/centerpose/results",
            "destination": "/results"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "network": "host"
    }
}

• How to reproduce the issue ?

execute the latest centerpose_synthetic notebook and specifically, the step

print("For multi-GPU, change train.num_gpus in train.yaml based on your machine.")
# If you face out of memory issue, you may reduce the batch size in the spec file by passing dataset. batch_size=2
!tao model centerpose train \
          -e $SPECS_DIR/train_synthetic.yaml \
          results_dir=$RESULTS_DIR/

will throw the error with the default dataset:

Error executing job with overrides: ['results_dir=/results/']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/scripts/train.py", line 84, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/scripts/train.py", line 70, in run_experiment
    trainer.fit(pt_model, dm, ckpt_path=resume_ckpt)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
    self._run_sanity_check()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1059, in _run_sanity_check
    val_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/model/pl_centerpose_model.py", line 136, in validation_step
    self.val_cp_evaluator.evaluate(final_output, batch)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/utils/centerpose_evaluator.py", line 279, in evaluate
center = np.asarray(anns['AR_data']['plane_center'])KeyError: 'plane_center'

What should be modified to be able to run the default notebook sucessfully?

Best regards

Please refer to tao_tutorials/notebooks/tao_launcher_starter_kit/centerpose/centerpose_synthetic.ipynb at main · NVIDIA/tao_tutorials · GitHub.
There is ar_data_converter. It will setup the data.

Thanks for the info. I actually called ar_data_converter when executing the notebook. I just tried to call the post processing cells again, and then executing

print("For multi-GPU, change train.num_gpus in train.yaml based on your machine.")
# If you face out of memory issue, you may reduce the batch size in the spec file by passing dataset. batch_size=2
!tao model centerpose train \
          -e $SPECS_DIR/train_synthetic.yaml \
          results_dir=$RESULTS_DIR/

It still throws that same error. I’m not sure why, as I reexecuted all necessary post-processing steps to actually format the training data to include the right fields. Is it possible that some changes do not make it through due to caching? I made sure that the changes actually happen to the camera_0.json files and they indeed are modified.

Looking into the file, the plance_center field is not set and AR_data is empty:

For example: 100_default_camera_0.json looks like this after calling all post processing steps:

{
    "AR_data": {},
    "camera_data": {
        "camera_projection_matrix": [
            [
                2.036992693252299,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                1.3579951288348657,
                0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                -1.000002000002,
                -0.20000020000020002
            ],
            [
                0.0,
                0.0,
                -1.0,
                0.0
            ]
        ],
        "camera_view_matrix": [
            [
                1.0,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                1.0,
                0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                1.0,
                0.0
            ],
            [
                0.0,
                0.0,
                0.0,
                1.0
            ]
        ],
        "height": 720,
        "intrinsics": {
            "cx": 540.0,
            "cy": 360.0,
            "fx": 733.3173695708275,
            "fy": 733.3173695708276
        },
        "width": 1080
    },
    "objects": []
}

Why is that? Sorry, if I overlooked something. This is all new for me and very overwhelming.

I suggest you to open a terminal to debug inside the docker instead of using the notebook.
Step:

  1. $ docker pull nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash
  2. $ docker run --runtime=nvidia -it nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash
    Inside the docker, then, run training.
    # centerpose train xxx

You can modify /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/utils/centerpose_evaluator.py to add debug code to check the ann file. tao_pytorch_backend/nvidia_tao_pytorch/cv/centerpose/utils/centerpose_evaluator.py at dc07b02eb78c2eb868315107892b466496e55a0f · NVIDIA/tao_pytorch_backend · GitHub.

1 Like

Hello Morganh,

thanks for your fast response, it is greatly appreciated. I will try to run the docker containers manually and report back if I could identify the issue.

Cheers

Hey Morgan,

the problem seems to be the synthetic pallet images generated in Isaac Sim. They are empty and only show the background plane, therefore the datafields for center etc do not exist:

Why is that? Does the rmw errors have to do something with it, that occur when running the isaac sim image:

[12.717s] [ext: omni.isaac.sim-4.0.0] startup
[12.777s] [ext: omni.isaac.ros2_bridge-2.26.4] startup
[12.793s] Using backup internal ROS2 humble distro
Checking to see if RMW can be loaded:
failed to get symbol 'rmw_init_options_init' due to Environment variable 'AMENT_PREFIX_PATH' is not set or empty, at /workspace/humble_ws/src/rmw_implementation/src/functions.cpp:171, at /workspace/humble_ws/src/rcl/rcl/src/rcl/init_options.c:75
RMW was not loaded

[12.800s] To use the internal libraries included with the extension please set the following environment variables to use with FastDDS (default) or CycloneDDS (ROS2 Humble only): 
RMW_IMPLEMENTATION=rmw_fastrtps_cpp
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/isaac-sim/exts/omni.isaac.ros2_bridge/humble/lib

OR 

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/isaac-sim/exts/omni.isaac.ros2_bridge/humble/lib
Before starting Isaac Sim
[12.834s] [ext: omni.isaac.ros2_bridge-2.26.4] shutdown
[13.715s] [ext: omni.kit.registry.nucleus-0.0.0] startup
syncing registry: 'https://ovextensionsprod.blob.core.windows.net/exts/kit/prod/shared/v2' Downloading 351 files...
syncing registry: 'https://ovextensionsprod.blob.core.windows.net/exts/kit/prod/sdk/106.0/91f0101f/v2' Downloading 272 files...
[54.241s] app ready

I additionally tried to remove the user configuration for docker in tao_mounts to avoid any permission problems as explained in the troubleshooting guide, but this didn’t do the trick.

Best regards

It is related to the Isaac Sim data generation. Please refer to " 2.2 Launch the synthetic data generation" of the notebook and confirm further.
You can try to modify the config file to generate different scenes with various objects, backgrounds, and numbers of target objects.
Then, run “2.3 Visualize the generated data”.

I made sure to execute all the necessary steps, the config file for synthetic data generation for isaac sim is as follows:

config_file="""
omni.replicator.object:
  version: 0.2.16
  num_frames: 20
  seed: 100
  inter_frame_time: 1
  gravity: 10000
  position_H:
    harmonizer_type: mutable_attribute
    mutable_attribute:
      distribution_type: range
      start:
      - -94.77713317047056
      - 0
      - -35.661244451558446
      end:
      - -94.77713317047056
      - 0
      - -35.661244451558446
  screen_height: 720
  focal_length: 14.228393962367306
  output_path: /tmpsrc/results
  horizontal_aperture: 20.955
  screen_width: 1080
  camera_parameters:
    far_clip: 100000
    focal_length: $(focal_length)
    horizontal_aperture: $(horizontal_aperture)
    near_clip: 0.1
    screen_height: $(screen_height)
    screen_width: $(screen_width)
  default_camera:
    count: 1
    camera_parameters: $(camera_parameters)
    transform_operators:
    - translate_global:
        distribution_type: harmonized
        harmonizer_name: position_H
    - rotateY: $[seed]*20
    - rotateX:
        distribution_type: range
        start: -15
        end: -25
    - translate:
        distribution_type: range
        start:
        - -40
        - -30
        - 400
        end:
        - 40
        - 30
        - 550
    type: camera
  distant_light:
    color:
      distribution_type: range
      end:
      - 1.3
      - 1.3
      - 1.3
      start:
      - 0.7
      - 0.7
      - 0.7
    count: 5
    intensity:
      distribution_type: range
      end: 600
      start: 150
    subtype: distant
    transform_operators:
    - rotateY:
        distribution_type: range
        end: 180
        start: -180
    - rotateX:
        distribution_type: range
        end: -10
        start: -40
    type: light
  dome_light:
    type: light
    subtype: dome
    color:
      distribution_type: range
      start:
      - 0.7
      - 0.7
      - 0.7
      end:
      - 1.3
      - 1.3
      - 1.3
    intensity:
      distribution_type: range
      start: 1000
      end: 3000
    transform_operators:
    - rotateX: 270
  plane:
    physics: collision
    type: geometry
    subtype: plane
    tracked: false
    transform_operators:
    - scale:
      - 5
      - 5
      - 5
  rotY_H:
    harmonizer_type: mutable_attribute
    mutable_attribute:
      distribution_type: range
      start: 0
      end: 0
  translate_H:
    harmonizer_type: mutable_attribute
    mutable_attribute:
      distribution_type: range
      start:
      - 0
      - 60
      - 0
      end:
      - 0
      - 30
      - 0

  pallet:
    count: 2
    physics: rigidbody
    type: geometry
    subtype: mesh
    tracked: true
    transform_operators:
    - translate_global:
        distribution_type: harmonized
        harmonizer_name: position_H
    - translate:
      - 120 * ($[index]%2)
      - 10 * ($[index]-1) * ($[index])
      - 0
    - rotateXYZ:
      - -90
      - 0
      - 0
    - scale:
      - 1
      - 1
      - 1
    usd_path:
      distribution_type: set
      values: 
      - omniverse://content.ov.nvidia.com/NVIDIA/Assets/DigitalTwin/Assets/Warehouse/Shipping/Pallets/Wood/Block_A/BlockPallet_A08_PR_NVD_01.usd
  box:
    count: 2
    physics: rigidbody
    type: geometry
    subtype: mesh
    tracked: false
    transform_operators:
    - translate_global:
        distribution_type: harmonized
        harmonizer_name: position_H
    - translate_pallet:
        distribution_type: harmonized
        harmonizer_name: translate_H
    - rotateY:
        distribution_type: harmonized
        harmonizer_name: rotY_H
    - translate:
      - 120 * ($[index])
      - 20
      - 0
    - rotateXYZ:
      - 0
      - -90
      - -90
    - scale:
      - 12
      - 10
      - 6
    usd_path:
      distribution_type: set
      values:
      - omniverse://content.ov.nvidia.com/NVIDIA/Assets/DigitalTwin/Assets/Warehouse/Shipping/Cardboard_Boxes/White_A/WhiteCorrugatedBox_A01_10x10x10cm_PR_NVD_01.usd
      - omniverse://content.ov.nvidia.com/NVIDIA/Assets/DigitalTwin/Assets/Warehouse/Shipping/Cardboard_Boxes/Cube_A/CubeBox_A01_10cm_PR_NVD_01.usd
  warehouse:
    type: geometry
    subtype: mesh
    usd_path: omniverse://content.ov.nvidia.com/NVIDIA/Assets/Isaac/2023.1.1/Isaac/Environments/Simple_Warehouse/warehouse_with_forklifts.usd
    transform_operators:
    - translate:
      - -200
      - 0.1
      - 0
    - rotateXYZ:
      - 0
      - -90
      - -90
    - scale:
      - 100
      - 100
      - 100

  output_switches:
    images: True
    labels: True
    descriptions: False
    3d_labels: True
    segmentation: False
"""

The forklifts and pallets do not appear in the renders tho, although according to the logs the files get downloaded correctly for the renders.

Hello,

I found the issue. The cloud server omniverse://content.ov.nvidia.com/ is not accessible, weirdly giving no errors for the failed asset downloads. Therefore, I had to use my local nucleus instance to use assets correctly. I replaced every instance of the server in the USD paths of the isaac sim config file with “localhost” for my local server setup.

Best regards

Thanks for the info.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.