Mask2Former_inst model training crashed after 1 epoch

Pritam · January 13, 2025, 9:10am

We are trying to train Mask2Former_inst model, after 1 epoch training automatically get crashed.

Below it the configuration.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
    img_dir: "/data/raw-data/train"
    batch_size: 8
    num_workers: 2
  val:
    type: 'coco'
    name: "coco_2017_val"
    instance_json: "/data/raw-data/annotations/coco_annotations_val_fixed_largeset.json"
    img_dir: "/data/raw-data/val"
    batch_size: 1
    num_workers: 2
  test:
    img_dir: /data/raw-data/val
    batch_size: 1
  augmentation:
    train_min_size: [640]
    train_max_size: 640
    train_crop_size: [640, 640]
    test_min_size: 640
    test_max_size: 640
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 1
  validation_interval: 1
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 80
export:
  input_channel: 3
  input_width: 640
  input_height: 640
  opset_version: 17
  batch_size: -1  # dynamic batch size
  on_cpu: False
gen_trt_engine:
  gpu_id: 0
  input_channel: 3
  input_width: 640
  input_height: 640
  tensorrt:
    data_type: fp16
    workspace_size: 4096
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1

Training Section:

print("For multi-GPU, set NUM_TRAIN_GPUS based on your machine.")
os.environ["NUM_TRAIN_GPUS"] = "1"
os.environ["HYDRA_FULL_ERROR"] = "1"
!tao model mask2former train -e $SPECS_DIR/spec_inst1.yaml \
           train.num_gpus=$NUM_TRAIN_GPUS \
           results_dir=$RESULTS_DIR

Training logs:

/usr/local/lib/python3.6/pty.py:84: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUSH) at 0x782256094648>
  pid, fd = os.forkpty()
For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 12:04:17,530 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 12:04:17,581 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 12:04:17,603 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 12:04:17,603 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 06:34:21,081 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=5.39s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.687   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.88s)
creating index...
index created!

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  2.10it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union

                                                                           
loading annotations into memory...
Done (t=5.51s)
creating index...
index created!

Epoch 0: 100%|██████████| 6250/6250 [1:25:22<00:00,  1.22it/s, v_num=1, train_loss=6.460, lr=0.0001]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/7927 [00:00<17:36,  7.50it/s]
Validation DataLoader 0:   0%|          | 2/7927 [00:00<16:02,  8.23it/s]
Validation DataLoader 0:   0%|          | 3/7927 [00:00<15:23,  8.58it/s]
Validation DataLoader 0:   0%|          | 4/7927 [00:00<14:18,  9.23it/s]
Validation DataLoader 0:   0%|          | 5/7927 [00:00<13:48,  9.56it/s]
Validation DataLoader 0:   0%|          | 6/7927 [00:00<12:56, 10.20it/s]
Validation DataLoader 0:   0%|          | 7/7927 [00:00<12:38, 10.44it/s]
Validation DataLoader 0:   0%|          | 8/7927 [00:00<12:48, 10.31it/s]
Validation DataLoader 0:   0%|          | 9/7927 [00:00<12:54, 10.22it/s]
Validation DataLoader 0:   0%|          | 10/7927 [00:00<13:02, 10.12it/s]
Validation DataLoader 0:   0%|          | 11/7927 [00:01<12:37, 10.45it/s]
Validation DataLoader 0:   0%|          | 12/7927 [00:01<12:17, 10.73it/s]
Validation DataLoader 0:   0%|          | 13/7927 [00:01<11:59, 10.99it/s]
Validation DataLoader 0:   0%|          | 14/7927 [00:01<11:54, 11.08it/s]
Validation DataLoader 0:   0%|          | 15/7927 [00:01<12:01, 10.97it/s]
Validation DataLoader 0:   0%|          | 16/7927 [00:01<12:10, 10.83it/s]
Validation DataLoader 0:   0%|          | 17/7927 [00:01<12:16, 10.74it/s]
Validation DataLoader 0:   0%|          | 18/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 19/7927 [00:01<12:30, 10.54it/s]
Validation DataLoader 0:   0%|          | 20/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 21/7927 [00:01<12:28, 10.56it/s]
Validation DataLoader 0:   0%|          | 22/7927 [00:02<12:24, 10.62it/s]
Validation DataLoader 0:   0%|          | 23/7927 [00:02<12:28, 10.57it/s]
Validation DataLoader 0:   0%|          | 24/7927 [00:02<12:32, 10.50it/s]
Validation DataLoader 0:   0%|          | 25/7927 [00:02<12:36, 10.45it/s]
Validation DataLoader 0:   0%|          | 26/7927 [00:02<12:39, 10.41it/s]
Validation DataLoader 0:   0%|          | 27/7927 [00:02<12:42, 10.36it/s]
Validation DataLoader 0:   0%|          | 28/7927 [00:02<12:44, 10.33it/s]
Validation DataLoader 0:   0%|          | 29/7927 [00:02<12:40, 10.38it/s]
Validation DataLoader 0:   0%|          | 30/7927 [00:02<12:38, 10.41it/s]
Validation DataLoader 0:   0%|          | 31/7927 [00:02<12:41, 10.37it/s]
Validation DataLoader 0:   0%|          | 32/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 33/7927 [00:03<12:45, 10.31it/s]
Validation DataLoader 0:   0%|          | 34/7927 [00:03<12:47, 10.28it/s]
Validation DataLoader 0:   0%|          | 35/7927 [00:03<12:44, 10.32it/s]
Validation DataLoader 0:   0%|          | 36/7927 [00:03<12:46, 10.30it/s]
Validation DataLoader 0:   0%|          | 37/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 38/7927 [00:03<12:37, 10.42it/s]
Validation DataLoader 0:   0%|          | 39/7927 [00:03<12:35, 10.45it/s]
Validation DataLoader 0:   1%|          | 40/7927 [00:03<12:38, 10.40it/s]
Validation DataLoader 0:   1%|          | 41/7927 [00:03<12:37, 10.41it/s]
Validation DataLoader 0:   1%|          | 42/7927 [00:04<12:36, 10.42it/s]
Validation DataLoader 0:   1%|          | 43/7927 [00:04<12:38, 10.40it/s]
.
.
.
.
.
.
.

Validation DataLoader 0: 100%|█████████▉| 7914/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7915/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7916/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7917/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7918/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7919/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7920/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7921/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7922/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7923/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7924/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7925/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7926/7927 [12:51<00:00, 10.28it/s]
Validation DataLoader 0: 100%|██████████| 7927/7927 [12:51<00:00, 10.28it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union


                                                                            
Epoch 0: 100%|██████████| 6250/6250 [1:38:14<00:00,  1.06it/s, v_num=1, train_loss=6.460, lr=0.0001, val_loss=11.20, mIoU=1.000, all_acc=1.000][2025-01-13 08:15:15,069 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-13 08:15:15,082 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-13 08:15:15,085 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-RTX-A4000'], 'success': False, 'time_lapsed': 6053} to https://api.tao.ngc.nvidia.com.
[2025-01-13 08:15:16,813 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-13 08:15:16,814 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-13 08:15:16,814 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-13 13:45:20,751 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

where we are making mistake? Please help.

Thanks.

Morganh · January 13, 2025, 9:12am

May I know which dgpu you are running? Could you please set a lower batch_size and retry? Thanks.

Pritam · January 13, 2025, 9:14am

I am using A4000.

I tried with lower batch size too but same issue. Some time it crashes before completion of 1 epoch.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0 Off |                  Off |
| 41%   47C    P8              15W / 140W |  15034MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2588      G   /usr/lib/xorg/Xorg                          115MiB |
|    0   N/A  N/A      2808      G   /usr/bin/gnome-shell                         72MiB |
|    0   N/A  N/A     21805      G   ...erProcess --variations-seed-version       50MiB |
|    0   N/A  N/A     24616      G   ...seed-version=20250109-180111.451000       41MiB |
+---------------------------------------------------------------------------------------+

Morganh · January 13, 2025, 9:18am

Before running training, can you run $nvidai-smi ? Is it 15034MiB?

Pritam · January 13, 2025, 9:23am

Actually I have started another round of training with changes in augmentation: train_max_size: 2048 to train_max_size: 640 that’s why it is showing 15GB memory used. Otherwise it does not take memory.

Morganh · January 13, 2025, 9:26am

Please double check the memory is 0 before training.
$ nvidia-smi

Then trigger training with 480 size in augmentation section, check the memory usage.

Pritam · January 13, 2025, 9:29am

After 15-20 second of training stopped it release memory.

(base) smarg@smarg-HP-Z1-G9-Tower-Desktop-PC:~/Documents/TAO/Model-Training/BOX-SEGMENTATION_V1.1_MASK2FORMER/data/maskrcnn$ nvidia-smi
Mon Jan 13 14:56:51 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0 Off |                  Off |
| 70%   75C    P2              43W / 140W |  15562MiB / 16376MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2588      G   /usr/lib/xorg/Xorg                          127MiB |
|    0   N/A  N/A      2808      G   /usr/bin/gnome-shell                         72MiB |
|    0   N/A  N/A     21805      G   ...erProcess --variations-seed-version       47MiB |
|    0   N/A  N/A     24616      G   ...seed-version=20250109-180111.451000       39MiB |
|    0   N/A  N/A     26351      C   python                                    15244MiB |
+---------------------------------------------------------------------------------------+
(base) smarg@smarg-HP-Z1-G9-Tower-Desktop-PC:~/Documents/TAO/Model-Training/BOX-SEGMENTATION_V1.1_MASK2FORMER/data/maskrcnn$ nvidia-smi
Mon Jan 13 14:57:39 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0 Off |                  Off |
| 47%   56C    P8              15W / 140W |    394MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2588      G   /usr/lib/xorg/Xorg                          154MiB |
|    0   N/A  N/A      2808      G   /usr/bin/gnome-shell                        132MiB |
|    0   N/A  N/A     21805      G   ...erProcess --variations-seed-version       52MiB |
|    0   N/A  N/A     24616      G   ...seed-version=20250109-180111.451000       30MiB |
+---------------------------------------------------------------------------------------+

Morganh · January 13, 2025, 9:32am

Could you please try a lower size?

Pritam · January 13, 2025, 9:34am

I have reduced the batch size to 4 for training.
below is training config.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
    img_dir: "/data/raw-data/train"
    batch_size: 4
    num_workers: 2
  val:
    type: 'coco'
    name: "coco_2017_val"
    instance_json: "/data/raw-data/annotations/coco_annotations_val_fixed_largeset.json"
    img_dir: "/data/raw-data/val"
    batch_size: 1
    num_workers: 2
  test:
    img_dir: /data/raw-data/val
    batch_size: 1
  augmentation:
    train_min_size: [640]
    train_max_size: 640
    train_crop_size: [640, 640]
    test_min_size: 640
    test_max_size: 640
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 2
  validation_interval: 2
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 80
export:
  input_channel: 3
  input_width: 640
  input_height: 640
  opset_version: 17
  batch_size: -1  # dynamic batch size
  on_cpu: False
gen_trt_engine:
  gpu_id: 0
  input_channel: 3
  input_width: 640
  input_height: 640
  tensorrt:
    data_type: fp16
    workspace_size: 4096
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1

Training logs:

For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 15:00:15,338 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 15:00:15,395 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 15:00:15,436 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 15:00:15,436 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 09:30:20,219 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=6.10s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.687   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.96s)
creating index...
index created!

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:01<00:00,  1.70it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union

                                                                           
loading annotations into memory...
Done (t=5.71s)
creating index...
index created!

Epoch 0:   4%|▎         | 439/12500 [03:15<1:29:40,  2.24it/s, v_num=1, train_loss=29.80, lr=0.0001]

Memory usage:

Mon Jan 13 15:02:44 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0 Off |                  Off |
| 67%   87C    P2             133W / 140W |   9278MiB / 16376MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2588      G   /usr/lib/xorg/Xorg                          127MiB |
|    0   N/A  N/A      2808      G   /usr/bin/gnome-shell                        129MiB |
|    0   N/A  N/A     21805      G   ...erProcess --variations-seed-version       50MiB |
|    0   N/A  N/A     24616      G   ...seed-version=20250109-180111.451000       55MiB |
|    0   N/A  N/A     28592      C   python                                     8886MiB |

Please suggest what other config param we can change?

Thanks.

Pritam · January 13, 2025, 9:42am

In my dataset, only single class (1 class i.e carton) is there and my image size is 3x640x640.

Should I change this in model section or it will be as it is?

model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 80

Please suggest.

Pritam · January 13, 2025, 11:50am

Hi @Morganh

This time i got again issue before epoch 1.

below is training logs.

For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 15:32:21,641 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 15:32:21,699 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 15:32:21,724 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 15:32:21,724 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 10:02:25,245 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=5.68s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.605   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.90s)
creating index...
index created!

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  2.05it/s]                                                                           
loading annotations into memory...
Done (t=5.68s)
creating index...
index created!

Epoch 0:  77%|███████▋  | 9626/12500 [1:11:30<21:21,  2.24it/s, v_num=1, train_loss=10.10, lr=0.0001][2025-01-13 11:14:20,104 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-13 11:14:20,116 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-13 11:14:20,121 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-RTX-A4000'], 'success': False, 'time_lapsed': 4314} to https://api.tao.ngc.nvidia.com.
[2025-01-13 11:14:21,909 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-13 11:14:21,910 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-13 11:14:21,910 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-13 16:44:25,887 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Morganh · January 14, 2025, 1:58am

Pritam:

I have reduced the batch size to 4 for training.
below is training config.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
    img_dir: "/data/raw-data/train"
    batch_size: 4
    num_workers: 2

Please retry with training batch_size:4 to batch_size:1, also num_workers:2 to num_workers:1.

Set above 640 to smaller one, for example, 320.

Set shm_size to 16G. See below.

Please refer to TAO Launcher - NVIDIA Docs.

Pritam · January 14, 2025, 4:56am

Dear @Morganh

I have move my setup on 2080TI gpu machine and started the training with config you have suggested.

for shm_size

# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tao_configs = {
   "Mounts":[
         # Mapping the Local project directory
        {
            "source": os.environ["LOCAL_PROJECT_DIR"],
            "destination": "/workspace/tao-experiments"
        },
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results_inst"
       }
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         },
        # "user": "{}:{}".format(os.getuid(), os.getgid()),
        "network": "host"
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tao_configs, mfile, indent=4)

For batch-size.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/train.json"
    img_dir: "/data/raw-data/train"
    batch_size: 1
    num_workers: 1
  val:
    type: 'coco'
    name: "coco_2017_val"
    instance_json: "/data/raw-data/annotations/val.json"
    img_dir: "/data/raw-data/val"
    batch_size: 1
    num_workers: 1
  test:
    img_dir: /data/raw-data/val
    batch_size: 1
  augmentation:
    train_min_size: [320]
    train_max_size: 320
    train_crop_size: [320, 320]
    test_min_size: 320
    test_max_size: 320
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 2
  validation_interval: 2
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 1
export:
  input_channel: 3
  input_width: 640
  input_height: 640
  opset_version: 17
  batch_size: -1  # dynamic batch size
  on_cpu: False
gen_trt_engine:
  gpu_id: 0
  input_channel: 3
  input_width: 640
  input_height: 640
  tensorrt:
    data_type: fp16
    workspace_size: 4096
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1

Training cell

print("For multi-GPU, set NUM_TRAIN_GPUS based on your machine.")
os.environ["NUM_TRAIN_GPUS"] = "1"
!tao model mask2former train -e $SPECS_DIR/spec_inst.yaml \
           train.num_gpus=$NUM_TRAIN_GPUS \
           results_dir=$RESULTS_DIR

Training logs:

For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-14 10:08:30,552 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-14 10:08:30,630 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-14 10:08:30,766 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-14 10:08:30,766 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-14 04:38:39,420 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=8.47s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.605   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=1.26s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00,  0.89it/s]                                                                           
loading annotations into memory...
Done (t=7.14s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.

Epoch 0:   8%|▊         | 4088/50000 [16:45<3:08:08,  4.07it/s, v_num=1, train_loss=29.20, lr=0.0001]

Will update if it again got crash.

Pritam · January 14, 2025, 5:58am

Hi @Morganh

I have change the configuration but still getting issue.

For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-14 10:08:30,552 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-14 10:08:30,630 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-14 10:08:30,766 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-14 10:08:30,766 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-14 04:38:39,420 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=8.47s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.605   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=1.26s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:02<00:00,  0.89it/s]                                                                           
loading annotations into memory...
Done (t=7.14s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.

Epoch 0:  33%|███▎      | 16274/50000 [1:10:17<2:25:40,  3.86it/s, v_num=1, train_loss=37.80, lr=0.0001][2025-01-14 05:49:29,861 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-14 05:49:29,883 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-14 05:49:29,888 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-GeForce-RTX-2080-Ti'], 'success': False, 'time_lapsed': 4249} to https://api.tao.ngc.nvidia.com.
[2025-01-14 05:49:31,642 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-14 05:49:31,643 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-14 05:49:31,643 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-14 11:19:38,304 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Please suggest what other option I can try.

Morganh · January 14, 2025, 6:44am

Could you run your latest config in A4000? The A4000 has more GPU memory.

More, you can also use the validation dataset as the training dataset. This can help you narrow down with less data.

Pritam · January 14, 2025, 6:50am

Okay.

Pritam · January 14, 2025, 9:36am

Dear @Morganh

What I have observed, With time the CPU RAM goes up and training stopped.

Please check the memory screen shot attached.

Why is it unable to free CPU RAM? It is increasing over time.

Pritam · January 14, 2025, 12:34pm

Dear @Morganh

We have tried with A4000 with suggested config but still same issue.
On A4000 we have also face CPU RAM increment with time.

We found memory growth with each steps and training crashed. Please suggest what can be the problem. The issue is on both machine 2080TI and A4000.

Morganh · January 15, 2025, 3:48am

To narrow down the CPU memory increasing issue, could you please try to run in the terminal instead of the jupyter notbeook?
You can run docker pull nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt via the terminal.

For jupyter notebook, maybe there is limitation on the memory. Please refer to ttps://stackoverflow.com/questions/57948003/how-to-increase-jupyter-notebook-memory-limit to increase Jupyter notebook memory.

Also, you can increase the SWAP memory.

Pritam · January 15, 2025, 7:37am

Yes I had run the same with terminal.

First I run the container using below command and update the dataset path with respect to container location.

docker run --runtime=nvidia -it --rm --shm-size=16g -v /home/smarg/Documents/PritamDocsData/TAO/:/home/data nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

Run training using below command.

mask2former train -e ./spec_inst.yaml train.num_gpus=1 results_dir=./results

Below are screenshot of RAM usage at different steps of single epoch.

Epoch:0 - 0% training

Epoch:0 - 13% training

Epoch:0 - 24% training

Epoch:0 - 40% training

Epoch:0 - 54% training

Epoch:0 - 90% training

Epoch:0 - 96% training

Epoch:0 - Validation

Epoch:0 - Training Stopped/Failed

Please suggest where can be the gaps.?

Thanks…

Topic		Replies	Views
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1718	August 22, 2023
DINO Training failed :: Default process group has not been initialized TAO Toolkit	5	767	October 3, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022
Classification_pyt error TAO Toolkit jetson	16	96	September 18, 2024
Deformable detr model keeps failing to train TAO Toolkit	5	537	February 1, 2024
Mask-RCNN int8 Version Results in Poor Performance TAO Toolkit	37	1005	July 6, 2022
OCDNet Tao Model Zoo TAO Toolkit jetson	7	39	October 22, 2024
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1033	November 21, 2023
Image Classification Pytorch Training Error TAO Toolkit cudnn	10	310	September 23, 2024
Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN TAO Toolkit	47	1685	June 16, 2022

Mask2Former_inst model training crashed after 1 epoch

Related topics