Train.yaml Doesn't exist!

dorukdilmen · May 9, 2024, 6:04am

Hi

"$ tao model dino run /bin/bash” this command was run.

But now I have different problems. I run this command “dino train -e /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/specs/train.yaml results_dir=/home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/retail_object_detection/results”. And File paths is right.

Result like below:

Then I tried in jupyter.

Morganh · May 9, 2024, 6:08am

Please check the setting inside tao_mounts.json file.
The path of the train.yaml should be a path inside the docker.

An easy way to check is to open a terminal.
$ tao model dino run /bin/bash
Then find the yaml file.

dorukdilmen · May 9, 2024, 6:15am

tao_mounts.json in jupyter is like:

But when I work local:

When I run this:

print("For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.")
print("For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.")
# If you face out of memory issue, you may reduce the batch size in the spec file by passing dataset.batch_size=2
!tao model dino train \
          -e $SPECS_DIR/train.yaml \
          results_dir=$RESULTS_DIR/

I take this:

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2024-05-09 10:18:05,853 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-05-09 10:18:05,951 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt
2024-05-09 10:18:05,966 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
sys:1: UserWarning: 
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /retail_object_detection/results/train
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Error executing job with overrides: ['results_dir=/retail_object_detection/results/']
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 222, in main
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 204, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 66, in run_experiment
    checkpoint = load_pretrained_weights(pretrained_path)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/misc.py", line 81, in load_pretrained_weights
    temp = torch.load(pretrained_path,
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 996, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/retail_object_detection/models/retail_object_detection_vtrainable_binary_v2.1.2/retail_object_detection_binary_v2.1.2.pth'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL

What's next?
  Try Docker Debug for seamless, persistent debugging tools in any container or image → docker debug e2b0396e8deb579a57dadb7a5ee2821c5898145bdd462984f00ece7216b170a6
  Learn more at https://docs.docker.com/go/debug-cli/
2024-05-09 10:18:28,524 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

But my .pth file’s path is

And train.yaml file

train:
  freeze: ['backbone', 'transformer.encoder']
  pretrained_model_path: /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/retail_object_detection/models/retail_object_detection_vtrainable_binary_v2.1.2/retail_object_detection_binary_v2.1.2.pth
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 1e-6
    lr: 1e-5
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
dataset:
  train_data_sources:
    - image_dir: /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/data/train
      json_file: /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/data/annotations/train.json
  val_data_sources:
    - image_dir: /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/data/test
      json_file: /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/data/annotations/test.json
  num_classes: 2
  batch_size: 4
  workers: 8
  augmentation:
    fixed_padding: False
model:
  backbone: fan_small
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

Morganh · May 9, 2024, 8:02am

The path inside the docker is defined in “destination”.
Suggest you to set the same “source” and “destination” to ease your work.

dorukdilmen · May 9, 2024, 8:20am

I’m currently getting this error

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/retail_object_detection/results/train exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type             | Params
----------------------------------------------------
0 | model          | DINOModel        | 48.3 M
1 | matcher        | HungarianMatcher | 0     
2 | criterion      | SetCriterion     | 0     
3 | box_processors | PostProcess      | 0     
----------------------------------------------------
12.2 M    Trainable params
36.1 M    Non-trainable params
48.3 M    Total params
193.014   Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%|                       | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:459: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Error executing job with overrides: ['results_dir=/home/doruk/doruk/getting_started_v5.3.0/notebooks/tao_launcher_starter_kit/retail_object_detection/retail_object_detection/results/']
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train
    self._run_sanity_check()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check
    val_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
    output = self._evaluation_step(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 256, in validation_step
    loss_dict = self.criterion(outputs, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py", line 173, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 86, in forward
    cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)
  File "/usr/local/lib/python3.10/dist-packages/torch/functional.py", line 1330, in cdist
    return _VF.cdist(x1, x2, p, None)  # type: ignore[attr-defined]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 222, in main
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 204, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 188, in run_experiment
    trainer.fit(pt_model, dm, ckpt_path=resume_ckpt or None)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
    trainer._teardown()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1161, in _teardown
    self.strategy.teardown()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 496, in teardown
    self.lightning_module.cpu()
  File "/usr/local/lib/python3.10/dist-packages/lightning_lite/utilities/device_dtype_mixin.py", line 78, in cpu
    return super().cpu()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 959, in cpu
    return self._apply(lambda t: t.cpu())
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 801, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 801, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 801, in _apply
    module._apply(fn)
  [Previous line repeated 5 more times]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 824, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 959, in <lambda>
    return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [33,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [37,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [45,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [49,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [61,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [65,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [69,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [73,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [77,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [81,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [85,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [89,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [93,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.

The error continues like this

Morganh · May 9, 2024, 8:27am

Please refer to DINO training gives error about insufficient shared memory (shm) - #16 by Morganh

dorukdilmen · May 9, 2024, 8:48am

Is Loss value normal?

Morganh · May 9, 2024, 3:54pm

You can monitor if the loss can decrease.

Morganh · May 13, 2024, 9:47am

More, you can use a pretrained model which is trained on DINO network.

dorukdilmen · May 14, 2024, 8:11am

First, I tried ‘retail_object_detection_binary_v2.1.2.pth’ but the Loss started at almost 80 and after 20 epochs the loss was still 20-25 . Then I tried ‘retail_object_detection_binary_v2.1.1.pth’ the loss started 17, after 8 epochs the loss became 5. But Training is so slow, 7-8 hours later almost 5 epochs is done.

Morganh · May 14, 2024, 2:47pm

May I know where did you download retail_object_detection_binary_v2.1.2.pth and retail_object_detection_binary_v2.1.1.pth ? Can you share the link?

dorukdilmen · May 14, 2024, 5:09pm

Morganh · May 16, 2024, 5:04am

The pretrained model is not trained by DINO network. So for DINO network, suggest to use the ones trained by DINO. See DINO | NVIDIA NGC,
or TAO Pretrained DINO with Foundational Model Backbone | NVIDIA NGC,
Pre-trained DINO ImageNet weights | NVIDIA NGC, Pre-trained DINO NvImageNet weights | NVIDIA NGC.

dorukdilmen · May 21, 2024, 7:51am

The reason I wanted to use TAO was because I would use the pth file of a model trained with retail dataset. Can I use this all .pth file for training by DINO network? Can’t I?

When I tried this DINO | NVIDIA NGC, I get this loss value.

Morganh · May 22, 2024, 4:49pm

OK, you can use the retail Object detection pretrained models.

yingliu · June 11, 2024, 2:17am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

system · June 25, 2024, 2:18am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	2110	August 22, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1644	July 6, 2022
Tao model error TAO Toolkit	9	303	October 21, 2024
Tao Training Model Error TAO Toolkit	7	591	January 15, 2024
FileNotFoundError: Model not found TAO Toolkit	5	185	July 27, 2024
Tao toolkit observations TAO Toolkit	56	1423	May 29, 2024
DINO Training failed :: Default process group has not been initialized TAO Toolkit	5	869	October 3, 2023
Tao toolkit facenet Error TAO Toolkit	14	1420	March 7, 2022
Error when using tao tool to train detectnet_v2 detection model TAO Toolkit	33	1578	February 5, 2022
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrc TAO Toolkit python	44	1984	September 5, 2023

Train.yaml Doesn't exist!

Related topics