OCDNet Tao Model Zoo

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : RTX 4070 Ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : OCDNet
• tao version: toolkit_version: 5.5.0

I’m following the notebook to train the OCDNet model. However I’m seeing the below error:

tao model ocdnet train -e /specs/train_ocdnet_vit.yaml results_dir=/results/train_1 model.pretrained_model_path=/results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth
2024-10-01 18:51:54,823 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-10-01 18:51:54,886 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-01 18:51:54,895 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-10-01 13:21:57,508 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'train_ocdnet_vit.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'train_ocdnet_vit.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results/train/status.json
  rank_zero_warn(
Seed set to 1234
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type   | Params
-------------------------------------
0 | criterion | DBLoss | 0     
1 | model     | Model  | 12.9 M
-------------------------------------
12.9 M    Trainable params
0         Non-trainable params
12.9 M    Total params
51.518    Total estimated model params size (MB)
Train results will be saved at: /results/train
loading pretrained model from /results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth

Epoch 0:   0%|          | 0/63 [00:00<?, ?it/s] Error executing job with overrides: ['results_dir=/results/train_1', 'model.pretrained_model_path=/results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth']Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/scripts/train.py", line 109, in main
    run_experiment(cfg)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/scripts/train.py", line 97, in run_experiment
    trainer.fit(ocd_model, dm, ckpt_path=resume_ckpt)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1032, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 138, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 242, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 191, in run
    self._optimizer_step(batch_idx, closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 269, in _optimizer_step
    call._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1303, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 152, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 239, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 148, in step
    loss = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/pl_ocd_model.py", line 179, in training_step
preds = self.model(batch['img'])  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/model.py", line 120, in forward
    backbone_out = self.backbone(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/fan.py", line 424, in forward
    x = self.forward_features(x)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/fan.py", line 386, in forward_features
    x, (Hp, Wp), out_list = self.patch_embed(x, return_feat=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/fan.py", line 156, in forward
    x, out_list = self.backbone.forward_features(x, return_feat=return_feat)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/convnext_utils.py", line 322, in forward_features
x = self.stages[i](x)  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/convnext_utils.py", line 210, in forward
    x = self.blocks(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/convnext_utils.py", line 155, in forward
    x = self.mlp(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/backbone/convnext_utils.py", line 100, in forward
    x = self.act(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 696, in forward
    return F.gelu(input, approximate=self.approximate)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.12 GiB. GPU 

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Epoch 0:   0%|          | 0/63 [00:06<?, ?it/s]WARNING: Logging before flag parsing goes to stderr.
W1001 13:22:13.156101 140350781510080 entrypoint.py:293] Execution status: FAIL
2024-10-01 18:52:13,762 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

The error is CUDA OUT OF MEMORY. I’m using RTX 4070. Also I’m monitoring via watch -n 2 nvidia-smi. Even after reducing the batch_size to 1 this error persists. The GPU process gets killed. Any help is highly appreciated. No other training or process is running in the GPU.
Here’s the status:

{"date": "10/1/2024", "time": "13:41:9", "status": "STARTED", "verbosity": "INFO", "message": "Starting OCDNet train."}
{"date": "10/1/2024", "time": "13:41:16", "status": "FAILURE", "verbosity": "INFO", "message": "CUDA out of memory. Tried to allocate 3.12 GiB. GPU "}

Here’s below the yaml file (which i uploaded as txt):
train_ocdnet_vit.txt (2.4 KB)

@Morganh

Which dataset do you use?
More, can you try lower num_workers?

@Morganh thanks for your response. I used notebook specified dataset. I will change num-workers and report

I tried the following:

model:
  load_pruned_graph: False
  pruned_graph_path: '/results/prune/pruned_0.1.pth'
  pretrained_model_path: '/results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth'
  backbone: fan_tiny_8_p4_hybrid
  enlarge_feature_map_size: True
  activation_checkpoint: True

train:
  results_dir: /results/train
  num_epochs: 80
  num_gpus: 1
  #resume_training_checkpoint_path: '/results/train/resume.pth'
  checkpoint_interval: 1
  validation_interval: 1
  is_dry_run: False
  precision: fp32
  model_ema: False
  model_ema_decay: 0.999
  trainer:
    clip_grad_norm: 5.0

  optimizer:
    type: Adam
    args:
      lr: 0.001

  lr_scheduler:
    type: WarmupPolyLR
    args:
      warmup_epoch: 3

  post_processing:
    type: SegDetectorRepresenter
    args:
      thresh: 0.3
      box_thresh: 0.55
      max_candidates: 1000
      unclip_ratio: 1.5

  metric:
    type: QuadMetric
    args:
      is_output_polygon: false

dataset:
  train_dataset:
      data_path: ['/data/ocdnet_vit/train']
      args:
        pre_processes:
          - type: IaaAugment
            args:
              - {'type':Fliplr, 'args':{'p':0.5}}
              - {'type': Affine, 'args':{'rotate':[-45,45]}}
              - {'type':Sometimes,'args':{'p':0.2, 'then_list':{'type': GaussianBlur, 'args':{'sigma':[1.5,2.5]}}}}
              - {'type':Resize,'args':{'size':[0.5,3]}}
          - type: EastRandomCropData
            args:
              size: [640,640]
              max_tries: 50
              keep_ratio: true
          - type: MakeBorderMap
            args:
              shrink_ratio: 0.4
              thresh_min: 0.3
              thresh_max: 0.7
          - type: MakeShrinkMap
            args:
              shrink_ratio: 0.4
              min_text_size: 8

        img_mode: BGR
        filter_keys: [img_path,img_name,text_polys,texts,ignore_tags,shape]
        ignore_tags: ['*', '###']
      loader:
        batch_size: 1
        pin_memory: true
        num_workers: 1

  validate_dataset:
      data_path: ['/data/ocdnet_vit/test']
      args:
        pre_processes:
          - type: Resize2D
            args:
              short_size:
                - 1280
                - 736
              resize_text_polys: true
        img_mode: BGR
        filter_keys: []
        ignore_tags: ['*', '###']
      loader:
        batch_size: 1
        pin_memory: false
        num_workers: 1

I’m using the following dataset:

`ICDAR2015` dataset for the OCDNet-ViT tutorial. 

However, still the problem persists. ANy help is highly appreciated. @Morganh

Can you share $nvidia-smi and $docker ps?

Also, your gpu has 12G memory only. You can try

  • disable - {'type':Sometimes,'args':{'p':0.2, 'then_list':{'type': GaussianBlur, 'args':{'sigma':[1.5,2.5]}}}}.
  • or set lower size: [640,640]
  • Use non-Vit backbone.


@Morganh What is the minimum GPU requirements for Vit-backbone?
docker-ps is now empty, but shows the details once I run through tao launcher

Suggest to use more than 12G.