TAO dino trianing tensorboard image visualization not working

Please provide the following information when requesting support.

• Hardware (T4)
• Network Type : dino
• Training spec file(If have, please share here)

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  precision: fp16
  checkpoint_interval: 1
  activation_checkpoint: True
  pretrained_model_path: /workspace/tao-experiments/dino/dino_model_epoch=003.pth
dataset:
  train_data_sources:
    - image_dir: /data/images/train/
      json_file: /data/train/annotations.json
  val_data_sources:
    - image_dir: /data/images/valid/
      json_file: /data/valid/annotations.json
  num_classes: 6
  batch_size: 8
  workers: 2
  augmentation:
    fixed_padding: True
model:
  backbone: fan_large
  train_backbone: False
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
After running the training , starting the tensorboard like below

tensorboard --logdir_spec=exp01:<result directory> --host 0.0.0.0 --port 8080

After running this i get the scalar graph in tensorboard , like loss and validation
but i couldn’t see any images with bounding boxes as it is passed to the model ,

I saw there is a setting we can add in spec file which is , as it is added to spec file

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  precision: fp16
  checkpoint_interval: 1
  activation_checkpoint: True
  pretrained_model_path: /workspace/tao-experiments/dino/dino_model_epoch=003.pth
  visualizer{
    enabled: true
  }
dataset:
  train_data_sources:
    - image_dir: /data/images/train/
      json_file: /data/train/annotations.json
  val_data_sources:
    - image_dir: /data/images/valid/
      json_file: /data/valid/annotations.json
  num_classes: 6
  batch_size: 8
  workers: 2
  augmentation:
    fixed_padding: True
model:
  backbone: fan_large
  train_backbone: False
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048


the `visualizer ’ config , but this option is not listed in the Dino training spec file doc and when i run with this configuration i get the error as well

Error

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 105, in run
    cfg = self.compose_config(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 594, in compose_config
    cfg = self.config_loader.load_configuration(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/config_loader_impl.py", line 141, in load_configuration
    return self._load_configuration_impl(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/config_loader_impl.py", line 235, in _load_configuration_impl
    self._process_config_searchpath(config_name, parsed_overrides, caching_repo)
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/config_loader_impl.py", line 158, in _process_config_searchpath
    loaded = repo.load_config(config_path=config_name)
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/config_repository.py", line 349, in load_config
    ret = self.delegate.load_config(config_path=config_path)
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/config_repository.py", line 92, in load_config
    ret = source.load_config(config_path=config_path)
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/core_plugins/file_config_source.py", line 31, in load_config
    cfg = OmegaConf.load(f)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py", line 192, in load
    obj = yaml.load(file_, Loader=get_yaml_loader())
  File "/usr/local/lib/python3.10/dist-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "/usr/local/lib/python3.10/dist-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/usr/local/lib/python3.10/dist-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/usr/local/lib/python3.10/dist-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/usr/local/lib/python3.10/dist-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/usr/local/lib/python3.10/dist-packages/yaml/composer.py", line 133, in compose_mapping_node
    item_value = self.compose_node(node, item_key)
  File "/usr/local/lib/python3.10/dist-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/usr/local/lib/python3.10/dist-packages/yaml/composer.py", line 127, in compose_mapping_node
    while not self.check_event(MappingEndEvent):
  File "/usr/local/lib/python3.10/dist-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/local/lib/python3.10/dist-packages/yaml/parser.py", line 428, in parse_block_mapping_key
    if self.check_token(KeyToken):
  File "/usr/local/lib/python3.10/dist-packages/yaml/scanner.py", line 115, in check_token
    while self.need_more_tokens():
  File "/usr/local/lib/python3.10/dist-packages/yaml/scanner.py", line 152, in need_more_tokens
    self.stale_possible_simple_keys()
  File "/usr/local/lib/python3.10/dist-packages/yaml/scanner.py", line 291, in stale_possible_simple_keys
    raise ScannerError("while scanning a simple key", key.mark,
yaml.scanner.ScannerError: while scanning a simple key
  in "/specs/train.yaml", line 15, column 3
could not find expected ':'
  in "/specs/train.yaml", line 16, column 12
Execution status: FAIL
2024-07-08 06:23:50,828 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Can you please advice on how to get the image visualization in tensorboard

This setting is not for DINO. Refer to https://docs.nvidia.com/tao/tao-toolkit/text/tensorboard_visualization.html.
This is a feature request for DINO.

Then do we have any other way that we can visualize the images in tensorboard for Dino training

Currently the DINO code does not support it. You may refer to Visualizing Models, Data, and Training with TensorBoard — PyTorch Tutorials 2.3.0+cu121 documentation to try.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.