DINO: Error executing job with overrides

Robert_Hoang · May 3, 2024, 1:48am

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Dino
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Training specs:

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  precision: fp16
dataset:
  train_data_sources:
    - image_dir: /ws/mm_trainer/data/pgie/train/images
      json_file: /ws/mm_trainer/data/pgie/train/train.json
  val_data_sources:
    - image_dir: /ws/mm_trainer/data/pgie/valid/images
      json_file: /ws/mm_trainer/data/pgie/valid/valid.json
  num_classes: 6
  batch_size: 4
  workers: 1
  augmentation:
    fixed_padding: False
model:
  backbone: fan_small
  train_backbone: False
  pretrained_backbone_path: /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

Reproduce

docker run --runtime=nvidia -it --ipc=host -v /home/tmp/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt /bin/bash
dino train -e /ws/tao_trainer/dino/train_total.yaml results_dir=/ws/tao_trainer/dino/training_models -k detection

Logs

sys:1: UserWarning: 
'train_total.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'train_total.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /ws/tao_trainer/dino/training_models/train
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Loaded pretrained weights from /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:240: UserWarning: Log file already exists at /ws/tao_trainer/dino/training_models/train/status.json
  rank_zero_warn(
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /ws/tao_trainer/dino/training_models/train/lightning_logs
Serializing 95898 elements to byte tensors and concatenating them all ...
Serialized dataset takes 23.86 MiB
Serializing 13820 elements to byte tensors and concatenating them all ...
Serialized dataset takes 3.40 MiB
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /ws/tao_trainer/dino/training_models/train exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type             | Params
----------------------------------------------------
0 | model          | DINOModel        | 48.1 M
1 | matcher        | HungarianMatcher | 0     
2 | criterion      | SetCriterion     | 0     
3 | box_processors | PostProcess      | 0     
----------------------------------------------------
19.7 M    Trainable params
28.4 M    Non-trainable params
48.1 M    Total params
96.206    Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0%|                                                                                                                                                   | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:459: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Sanity Checking DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.16it/s]
 Validation mAP : 0.0


 Validation mAP50 : 0.0

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Training: 0it [00:00, ?it/s]Starting Training Loop.
Epoch 0:   0%|                                                                                                                                                                    | 0/27429 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py:199: UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
  t = torch.range(0, len(targets[i]['labels']) - 1).long().cuda()
Epoch 0:  62%|███████████████████████████████████████████████████████████████████████████████▉                                                | 17132/27429 [1:58:45<1:11:22,  2.40it/s, loss=37.8, v_num=0]
Error executing job with overrides: ['encryption_key=threat_detection', 'results_dir=/ws/tao_trainer/dino/training_models']
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 222, in main
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 204, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 188, in run_experiment
    trainer.fit(pt_model, dm, ckpt_path=resume_ckpt or None)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
    closure_result = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 203, in training_step
    loss_dict = self.criterion(outputs, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py", line 173, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py", line 80, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0:  62%|███████████████████████████████████████████████████████████████████████████████▉                                                | 17132/27429 [1:58:45<1:11:22,  2.40it/s, loss=37.8, v_num=0]
Execution status: FAIL

Morganh · May 3, 2024, 3:12am

Robert_Hoang:

  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py", line 80, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

The boxes should be in [x0, y0, x1, y1] format. See tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py at main · NVIDIA/tao_pytorch_backend · GitHub.
Please check your label json file.
From Cannot run Dino with tao-5.3.0 - #3 by Robert_Hoang, it is in [x1, y1, x0, y0] format.
You can refer to the coco json file as well. tao_tutorials/notebooks/tao_launcher_starter_kit/dino/dino.ipynb at main · NVIDIA/tao_tutorials · GitHub.

Robert_Hoang · May 3, 2024, 3:14am

Thank you, i will check it

Robert_Hoang · May 3, 2024, 4:07am

Hi @Morganh
My boxes format is [x0, y0, box_w, box_h].

In the documentation the dataset should be COCO format. So the annotation should be:
Annotation format

"annotations": [{"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}],

Can you pls double check?

Robert_Hoang · May 3, 2024, 4:24am

 File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

The format of the groundtruth boxes is [x0, y0, box_w, box_h], why the function above is:

box_cxcywh_to_xyxy(tgt_bbox)

Is it a bug?

Morganh · May 3, 2024, 4:29am

Could you use the default notebook to double check? You can run with it to check if it works.

Robert_Hoang · May 3, 2024, 4:35am

Yes, i will do it and let you know

Robert_Hoang · May 3, 2024, 3:30pm

Training with a small amount of custom datasets, it works.
However, Still got the same error on my whole dataset.
I don’t know why the epoch 0 and 1 are trained successfully, epoch 2 got the error.

Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27429/27429 [2:57:57<00:00,  2.57it/s, loss=18.8, v_num=0]
 Validation mAP : 0.27218357507410024███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3455/3455 [11:56<00:00,  4.82it/s]


 Validation mAP50 : 0.5874690160026202

Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27429/27429 [2:58:07<00:00,  2.57it/s, loss=18.8, v_num=0, val_loss=11.70Train and Val metrics generated.                                                                                                                                                                             
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 27429/27429 [2:58:08<00:00,  2.57it/s, loss=18.8, v_num=0, val_loss=11.70, train_loss=23.40]Training loop in progress
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 27429/27429 [2:58:24<00:00,  2.56it/s, loss=14.1, v_num=0, val_loss=11.70, train_loss=23.40]
 Validation mAP : 0.44921385974558714███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3455/3455 [11:58<00:00,  4.81it/s]


 Validation mAP50 : 0.7339936641902469

Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 27429/27429 [2:58:33<00:00,  2.56it/s, loss=14.1, v_num=0, val_loss=10.20, train_loss=23.40Train and Val metrics generated.                                                                                                                                                                             
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 27429/27429 [2:58:34<00:00,  2.56it/s, loss=14.1, v_num=0, val_loss=10.20, train_loss=16.10]Training loop in progress
Epoch 2:  80%|██████████████████████████████████████████████████████████████████████████████▏                   | 21887/27429 [2:31:54<38:27,  2.40it/s, loss=16, v_num=0, val_loss=10.20, train_loss=16.10]
Error executing job with overrides: ['encryption_key=detection', 'results_dir=/ws/tao_trainer/dino/training_models']
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 222, in main
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 204, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 188, in run_experiment
    trainer.fit(pt_model, dm, ckpt_path=resume_ckpt or None)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
    closure_result = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 203, in training_step
    loss_dict = self.criterion(outputs, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py", line 173, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py", line 80, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 2:  80%|██████████████████████████████████████████████████████████████████████████████▏                   | 21887/27429 [2:31:55<38:28,  2.40it/s, loss=16, v_num=0, val_loss=10.20, train_loss=16.10]
Execution status: FAIL

Robert_Hoang · May 3, 2024, 3:37pm

Status:

{"date": "5/3/2024", "time": "5:49:59", "status": "STARTED", "verbosity": "INFO", "message": "Starting Training Loop."}
{"date": "5/3/2024", "time": "8:48:7", "status": "RUNNING", "verbosity": "INFO", "message": "Train and Val metrics generated.", "kpi": {"val_mAP": "0.27218357507410024", "val_mAP50": "0.5874690160026202", "val_loss": 11.73005093900236, "train_loss": 23.377797393730408}}
{"epoch": 1, "max_epoch": 12, "time_per_epoch": "2:58:08.381009", "eta": "1 day, 8:39:32.191095", "date": "5/3/2024", "time": "8:48:7", "status": "RUNNING", "verbosity": "INFO", "message": "Training loop in progress", "kpi": {"val_mAP": "0.27218357507410024", "val_mAP50": "0.5874690160026202", "val_loss": 11.73005093900236, "train_loss": 23.377797393730408}}
{"date": "5/3/2024", "time": "11:46:42", "status": "RUNNING", "verbosity": "INFO", "message": "Train and Val metrics generated.", "kpi": {"val_mAP": "0.44921385974558714", "val_mAP50": "0.7339936641902469", "val_loss": 10.163458914901689, "train_loss": 16.141359177027173}}
{"epoch": 2, "max_epoch": 12, "time_per_epoch": "2:58:34.603788", "eta": "1 day, 5:45:46.037884", "date": "5/3/2024", "time": "11:46:42", "status": "RUNNING", "verbosity": "INFO", "message": "Training loop in progress", "kpi": {"val_mAP": "0.44921385974558714", "val_mAP50": "0.7339936641902469", "val_loss": 10.163458914901689, "train_loss": 16.141359177027173}}
{"date": "5/3/2024", "time": "14:18:38", "status": "FAILURE", "verbosity": "INFO", "kpi": {"val_mAP": "0.44921385974558714", "val_mAP50": "0.7339936641902469", "val_loss": 10.163458914901689, "train_loss": 16.141359177027173}}

Robert_Hoang · May 3, 2024, 4:04pm

Maybe the bug comes from tao-dino itself. Based on the logs, the training process got the error due to out_bbox at:

assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Morganh · May 3, 2024, 4:19pm

Did you find the error when you run default notebook?
If not, please use below way to debug your own dataset.

Inside the docker,
$ mv /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py.bak
$ vim /usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py (copy the content from tao_pytorch_backend/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py at v5.2.0_github · NVIDIA/tao_pytorch_backend · GitHub)

You can print the bbox for line80, etc.

yingliu · May 28, 2024, 7:52am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · June 11, 2024, 7:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DINO Training failed :: Default process group has not been initialized TAO Toolkit	5	804	October 3, 2023
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1845	August 22, 2023
Train.yaml Doesn't exist! TAO Toolkit	16	531	June 11, 2024
Cannot run Dino with tao-5.3.0 TAO Toolkit	7	418	May 17, 2024
Classification_pyt error TAO Toolkit jetson	16	156	September 18, 2024
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found TAO Toolkit	11	2493	February 13, 2022
TAO ssd training error TAO Toolkit	20	747	April 10, 2024
Tao pre-trained yolo4tiny - AssertionError: Must have more boxes than clusters TAO Toolkit	54	2465	January 21, 2022
Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] TAO Toolkit	51	1454	July 26, 2022
Tao detectnet_v2 train failed with g_error_metadata.to_exception in autograph module TAO Toolkit tao	12	1422	January 10, 2022

DINO: Error executing job with overrides

Related topics