DINO Training failed :: Default process group has not been initialized

Please provide the following information when requesting support.

• Hardware (RTX 2070 - 8GB)
• Network Type (DINO)
• TLT Version ( 5.0.0-pyt)
I am trying to train DINO with fan_small backbone with my custom 7 class objects for detection. The dataset is serialized in json as COCO format. Here is my train.yaml file

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  precision : fp16 
  distributed_strategy : ddp_sharded
  optim:
    lr: 0.0002
    lr_backbone: 0.00002
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: MultiStep
    lr_steps: [11]
    lr_decay: 0.1
  num_epochs: 12
  activation_checkpoint: True
  
  
  
dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/data/Location8
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00000-of-00004.json
    - image_dir: /workspace/tao-experiments/data/Location8
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00001-of-00004.json
    - image_dir: /workspace/tao-experiments/data/Location8
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00002-of-00004.json
    - image_dir: /workspace/tao-experiments/data/Location8
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00003-of-00004.json

    - image_dir: /workspace/tao-experiments/data/LocationOD2
      json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00000-of-00004.json
    - image_dir: /workspace/tao-experiments/data/LocationOD2
      json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00001-of-00004.json
    - image_dir: /workspace/tao-experiments/data/LocationOD2
      json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00002-of-00004.json
    - image_dir: /workspace/tao-experiments/data/LocationOD2
      json_file: /workspace/tao-experiments/sharded/LocationOD2/LocationOD2-shard-00003-of-00004.json
      
    - image_dir: /workspace/tao-experiments/data/LocationOD3
      json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00000-of-00004.json
    - image_dir: /workspace/tao-experiments/data/LocationOD3
      json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00001-of-00004.json
    - image_dir: /workspace/tao-experiments/data/LocationOD3
      json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00002-of-00004.json
    - image_dir: /workspace/tao-experiments/data/LocationOD3
      json_file: /workspace/tao-experiments/sharded/LocationOD3/LocationOD3-shard-00003-of-00004.json

  val_data_sources:
    - image_dir: /workspace/tao-experiments/data/Location5
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00000-of-00004.json
    - image_dir: /workspace/tao-experiments/data/Location5
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00001-of-00004.json
    - image_dir: /workspace/tao-experiments/data/Location5
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00002-of-00004.json
    - image_dir: /workspace/tao-experiments/data/Location5
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00003-of-00004.json

      
  num_classes: 7
  batch_size: 2
  workers: 8
  augmentation:
    fixed_padding: True
  dataset_type: serialized
    
model:
  backbone: fan_small
  train_backbone: True
  pretrained_backbone_path: /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 5
  dropout_ratio: 0.0
  dim_feedforward: 2048

Now when I run training, the output shows error:

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-10-03 01:01:59,719 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-10-03 01:01:59,790 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-10-03 01:01:59,832 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning: 
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen core.hydra.hydra_runner>:107: UserWarning: 
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results/train
Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])
<frozen core.loggers.api_logging>:245: UserWarning: Log file already exists at /results/train/status.json
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/train/lightning_logs
Serializing 5138412 elements to byte tensors and concatenating them all ...
Serialized dataset takes 2883.27 MiB
Serializing 1706016 elements to byte tensors and concatenating them all ...
Serialized dataset takes 827.40 MiB
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /results/train exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Default process group has not been initialized, please make sure to call init_process_group.
Error executing job with overrides: ['results_dir=/results/']
An error occurred during Hydra's exception formatting:
AssertionError()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 254, in run_and_report
    assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py>", line 3, in <module>
  File "<frozen cv.dino.scripts.train>", line 209, in <module>
  File "<frozen core.hydra.hydra_runner>", line 107, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 296, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "<frozen cv.dino.scripts.train>", line 205, in main
  File "<frozen cv.dino.scripts.train>", line 194, in main
  File "<frozen cv.dino.scripts.train>", line 172, in run_experiment
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
    self.strategy.setup(self)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/single_device.py", line 74, in setup
    super().setup(trainer)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/strategy.py", line 154, in setup
    self.setup_optimizers(trainer)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/strategy.py", line 142, in setup_optimizers
    self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers(
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/optimizer.py", line 180, in _init_optimizers_and_lr_schedulers
    optim_conf = model.trainer._call_lightning_module_hook("configure_optimizers", pl_module=model)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "<frozen cv.dino.model.pl_dino_model>", line 136, in configure_optimizers
  File "/usr/local/lib/python3.8/dist-packages/fairscale/optim/oss.py", line 156, in __init__
    self.world_size = dist.get_world_size(self.group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1092, in get_world_size
    return _get_group_size(group)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 527, in _get_group_size
    default_pg = _get_default_group()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 658, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Execution status: FAIL
2023-10-03 01:11:46,095 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

I have tried to tune some parameters to optimize resource allocation as described in Documentation . Any help with the error?

Can you share the full command? More, did you ever run official notebook successfully?

I am using the official notebook. The training command is :

# If you face out of memory issue, you may reduce the batch size in the spec file by passing dataset.batch_size=2
!tao model dino train \
          -e $SPECS_DIR/train.yaml \
          results_dir=$RESULTS_DIR/

When I run the training with lesser data, and stick to the original training config file (train_backbone=False to save memory, batch_size=1)

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00000-of-00004.json
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00001-of-00004.json
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00002-of-00004.json
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location8/Location8-shard-00003-of-00004.json

  val_data_sources:
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00000-of-00004.json
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00001-of-00004.json
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00002-of-00004.json
    - image_dir: /workspace/tao-experiments/data/
      json_file: /workspace/tao-experiments/sharded/Location5/Location5-shard-00003-of-00004.json
  num_classes: 7
  batch_size: 1
  workers: 8
  augmentation:
    fixed_padding: False
  dataset_type: serialized
model:
  backbone: fan_tiny
  train_backbone: False
  pretrained_backbone_path: /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_hybrid_tiny_nvimagenet/fan_hybrid_tiny_nvimagenetv2.pth.tar
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

the train runs for some epoch then fails with the error:

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2023-10-03 13:30:15,768 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-10-03 13:30:15,834 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
2023-10-03 13:30:15,874 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
INFO: generated new fontManager
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.8/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys:1: UserWarning: 
'train-original.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen core.hydra.hydra_runner>:107: UserWarning: 
'train-original.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results/train
Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_hybrid_tiny_nvimagenet/fan_hybrid_tiny_nvimagenetv2.pth.tar
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias'])
<frozen core.loggers.api_logging>:245: UserWarning: Log file already exists at /results/train/status.json
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /results/train/lightning_logs
Serializing 1496664 elements to byte tensors and concatenating them all ...
Serialized dataset takes 1135.31 MiB
Serializing 1706016 elements to byte tensors and concatenating them all ...
Serialized dataset takes 811.13 MiB
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /results/train exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Adjusting learning rate of group 0 to 2.0000e-04.
Adjusting learning rate of group 1 to 2.0000e-05.

  | Name           | Type             | Params
----------------------------------------------------
0 | model          | DINOModel        | 28.3 M
1 | matcher        | HungarianMatcher | 0     
2 | criterion      | SetCriterion     | 0     
3 | box_processors | PostProcess      | 0     
----------------------------------------------------
19.7 M    Trainable params
8.6 M     Non-trainable params
28.3 M    Total params
113.075   Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%|                       | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/usr/local/lib/python3.8/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3435.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Sanity Checking DataLoader 0: 100%|███████████████| 2/2 [00:03<00:00,  1.99s/it]
 Validation mAP : 0.0


 Validation mAP50 : 0.0

Training: 0it [00:00, ?it/s]Starting Training Loop.                             
Epoch 0:   0%|                                      | 0/3202680 [00:00<?, ?it/s]<frozen cv.dino.model.criterion>:201: UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
Epoch 0:   0%|    | 109/3202680 [00:41<337:26:59,  2.64it/s, loss=59.7, v_num=0]/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [21,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [105,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [77,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [91,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [49,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [7,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [83,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [71,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [85,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [83,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [13,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [8,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [101,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [115,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [41,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [73,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [87,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [45,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [13,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [99,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [113,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [6,0,0], thread: [127,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [17,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [4,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [97,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [111,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1,0,0], thread: [125,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [11,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [9,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [5,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [107,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [121,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [109,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [123,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [21,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [103,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [117,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [105,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [119,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [79,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [93,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [75,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [89,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [81,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [3,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [77,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [91,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [33,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [2,0,0], thread: [61,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [49,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Error executing job with overrides: ['results_dir=/results/']

Could it be caused by my converted json annotations? I have generated the json with
!tao model dino convert -e $SPECS_DIR/data2TFrec.yaml
which gives

Cumulative object statistics
Class map. 
Label in GT: Label in Json file 
b'motorcycle': motorcycle
b'threewheeler': threewheeler
b'car': car
b'nonmotorized': nonmotorized
b'bus': bus
b'bicycle': bicycle
b'truck': truck
Class map. 
Label in GT: Label in Json file 
b'motorcycle': motorcycle
b'threewheeler': threewheeler
b'car': car
b'nonmotorized': nonmotorized
b'bus': bus
b'bicycle': bicycle
b'truck': truck
For the dataset_config in the experiment_spec, please use labels in the Json file, while writing the classmap.

For the dataset_config in the experiment_spec, please use labels in the Json file, while writing the classmap.

Json generation complete.
Json generation complete.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please refer to the solution in DINO training gives error about insufficient shared memory (shm) - #14 by paul.doucet.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.