Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) : RTX 4070 Ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : OCDNet
• tao version: toolkit_version: 5.5.0
I’m following the notebook to train the OCDNet model. However I’m seeing the below error:
tao model ocdnet train -e /specs/train_ocdnet_vit.yaml results_dir=/results/train_1 model.pretrained_model_path=/results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth
2024-10-01 18:51:54,823 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-10-01 18:51:54,886 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-10-01 18:51:54,895 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-10-01 13:21:57,508 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning:
'train_ocdnet_vit.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
'train_ocdnet_vit.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results/train/status.json
rank_zero_warn(
Seed set to 1234
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-------------------------------------
0 | criterion | DBLoss | 0
1 | model | Model | 12.9 M
-------------------------------------
12.9 M Trainable params
0 Non-trainable params
12.9 M Total params
51.518 Total estimated model params size (MB)
Train results will be saved at: /results/train
loading pretrained model from /results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth
Epoch 0: 0%| | 0/63 [00:00<?, ?it/s] Error executing job with overrides: ['results_dir=/results/train_1', 'model.pretrained_model_path=/results/pretrained_ocdnet/ocdnet_vtrainable_ocdnet_vit_v1.0/ocdnet_fan_tiny_2x_icdar.pth']Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
raise e
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/scripts/train.py", line 109, in main
run_experiment(cfg)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/scripts/train.py", line 97, in run_experiment
trainer.fit(ocd_model, dm, ckpt_path=resume_ckpt)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1032, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 138, in run
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 242, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 191, in run
self._optimizer_step(batch_idx, closure)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 269, in _optimizer_step
call._call_lightning_module_hook(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1303, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 152, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 239, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/adam.py", line 148, in step
loss = closure()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
closure_result = closure()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
self._result = self.closure(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
step_output = self._step_fn()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 319, in _training_step
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 391, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/pl_ocd_model.py", line 179, in training_step
preds = self.model(batch['img']) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/model.py", line 120, in forward
backbone_out = self.backbone(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/fan.py", line 424, in forward
x = self.forward_features(x)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/fan.py", line 386, in forward_features
x, (Hp, Wp), out_list = self.patch_embed(x, return_feat=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/fan.py", line 156, in forward
x, out_list = self.backbone.forward_features(x, return_feat=return_feat)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/convnext_utils.py", line 322, in forward_features
x = self.stages[i](x) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/convnext_utils.py", line 210, in forward
x = self.blocks(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/ocdnet/model/backbone/convnext_utils.py", line 155, in forward
x = self.mlp(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/backbone/convnext_utils.py", line 100, in forward
x = self.act(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py", line 696, in forward
return F.gelu(input, approximate=self.approximate)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.12 GiB. GPU
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0: 0%| | 0/63 [00:06<?, ?it/s]WARNING: Logging before flag parsing goes to stderr.
W1001 13:22:13.156101 140350781510080 entrypoint.py:293] Execution status: FAIL
2024-10-01 18:52:13,762 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
The error is CUDA OUT OF MEMORY
. I’m using RTX 4070
. Also I’m monitoring via watch -n 2 nvidia-smi
. Even after reducing the batch_size
to 1 this error persists. The GPU process gets killed. Any help is highly appreciated. No other training or process is running in the GPU.
Here’s the status:
{"date": "10/1/2024", "time": "13:41:9", "status": "STARTED", "verbosity": "INFO", "message": "Starting OCDNet train."}
{"date": "10/1/2024", "time": "13:41:16", "status": "FAILURE", "verbosity": "INFO", "message": "CUDA out of memory. Tried to allocate 3.12 GiB. GPU "}
Here’s below the yaml
file (which i uploaded as txt):
train_ocdnet_vit.txt (2.4 KB)