Please provide the following information when requesting support.
• Hardware (RTX 2070 - 8GB)
• Network Type (DINO)
• TLT Version ( 5.0.0-pyt)
I am trying to train DINO with fan_small backbone with my custom 7 class objects for detection. The dataset is serialized in json as COCO format. Here is my train.yaml file
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
precision : fp16
distributed_strategy : ddp_sharded
optim:
lr: 0.0002
lr_backbone: 0.00002
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [11]
lr_decay: 0.1
num_epochs: 12
activation_checkpoint: True
dataset:
train_data_sources:
- image_dir: /workspace/tao-experiments/data/Location8
json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00000-of-00004.json
- image_dir: /workspace/tao-experiments/data/Location8
json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00001-of-00004.json
- image_dir: /workspace/tao-experiments/data/Location8
json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00002-of-00004.json
- image_dir: /workspace/tao-experiments/data/Location8
json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00003-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD2
json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00000-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD2
json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00001-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD2
json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00002-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD2
json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00003-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD3
json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00000-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD3
json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00001-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD3
json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00002-of-00004.json
- image_dir: /workspace/tao-experiments/data/LocationOD3
json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00003-of-00004.json
val_data_sources:
- image_dir: /workspace/tao-experiments/data/Location5
json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00000-of-00004.json
- image_dir: /workspace/tao-experiments/data/Location5
json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00001-of-00004.json
- image_dir: /workspace/tao-experiments/data/Location5
json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00002-of-00004.json
- image_dir: /workspace/tao-experiments/data/Location5
json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00003-of-00004.json
num_classes: 7
batch_size: 2
workers: 8
augmentation:
fixed_padding: True
dataset_type: serialized
model:
backbone: fan_small
train_backbone: True
pretrained_backbone_path: /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_select: 5
dropout_ratio: 0.0
dim_feedforward: 2048
Now when I run training, the output shows error:
For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-10-03 01:01:59,719 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-10-03 01:01:59,790 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-10-03 01:01:59,832 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning:
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen core.hydra.hydra_runner>:107: UserWarning:
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results/train
Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])
<frozen core.loggers.api_logging>:245: UserWarning: Log file already exists at /results/train/status.json
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/train/lightning_logs
Serializing 5138412 elements to byte tensors and concatenating them all ...
Serialized dataset takes 2883.27 MiB
Serializing 1706016 elements to byte tensors and concatenating them all ...
Serialized dataset takes 827.40 MiB
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /results/train exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Default process group has not been initialized, please make sure to call init_process_group.
Error executing job with overrides: ['results_dir=/results/']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 254, in run_and_report
assert mdl is not None
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py>", line 3, in <module>
File "<frozen cv.dino.scripts.train>", line 209, in <module>
File "<frozen core.hydra.hydra_runner>", line 107, in wrapper
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 296, in run_and_report
raise ex
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "<frozen cv.dino.scripts.train>", line 205, in main
File "<frozen cv.dino.scripts.train>", line 194, in main
File "<frozen cv.dino.scripts.train>", line 172, in run_experiment
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/single_device.py", line 74, in setup
super().setup(trainer)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/strategy.py", line 154, in setup
self.setup_optimizers(trainer)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/strategy.py", line 142, in setup_optimizers
self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 180, in _init_optimizers_and_lr_schedulers
optim_conf = model.trainer._call_lightning_module_hook("configure_optimizers", pl_module=model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "<frozen cv.dino.model.pl_dino_model>", line 136, in configure_optimizers
File "/usr/local/lib/python3.8/dist-packages/fairscale/optim/oss.py", line 156, in __init__
self.world_size = dist.get_world_size(self.group)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1092, in get_world_size
return _get_group_size(group)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 527, in _get_group_size
default_pg = _get_default_group()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 658, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Execution status: FAIL
2023-10-03 01:11:46,095 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.
I have tried to tune some parameters to optimize resource allocation as described in Documentation . Any help with the error?