TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval_hook'

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) Dual A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.2.0.1-pyt1.14.0:
• Training spec file(If have, please share here) See below
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) See below

Hi,

For the first time I’m having issues during segformer training. It appears that when the iteration reaches the point where a validation_interval is triggered the container fails with:

OSError: [Errno 39] Directory not empty: ‘/results/train/.eval_hook’

The container is started with:

!tao model segformer train
-e $SPECS_DIR/train.yaml
-r $RESULTS_DIR
-g $NUM_GPUS

The container to host filesystem is valid as I am getting *.pth files as the training progresses. Here is the training spec:

train:
exp_config:
manual_seed: 49
checkpoint_interval: 50
logging_interval: 50
max_iters: 220
resume_training_checkpoint_path: null
validate: True
validation_interval: 220
trainer:
find_unused_parameters: True
sf_optim:
lr: 0.00006
model:
input_height: 800
input_width: 800
pretrained_model_path: null
backbone:
type: “mit_b5”
dataset:
data_root: /tlt-pytorch
input_type: “rgb”
img_norm_cfg:
mean:
- 127.5
- 127.5
- 127.5
std:
- 127.5
- 127.5
- 127.5
to_rgb: True
train_dataset:
img_dir:
- /data/training/images
ann_dir:
- /data/training/masks
pipeline:
augmentation_config:
random_crop:
crop_size:
- 700
- 700
cat_max_ratio: 0.75
resize:
ratio_range:
- 0.5
- 2.0
random_flip:
prob: 0.5
val_dataset:
img_dir: /data/val/images
ann_dir: /data/val/masks
palette:
- seg_class: background
rgb:
- 0
- 0
- 0
label_id: 0
mapping_class: background
- seg_class: window
rgb:
- 255
- 255
- 255
label_id: 1
mapping_class: foreground
repeat_data_times: 500
batch_size: 6
workers_per_gpu: 24

Here is the detailed log output:

[> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 86/85, 12.9 task/s, elapsed: 7s, ETA: 0sError executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/results’]

An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py>”, line 3, in
File “”, line 176, in
File “”, line 107, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “”, line 172, in main
File “”, line 162, in main
File “”, line 130, in run_experiment
File “”, line 198, in train_segmentor
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py”, line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py”, line 70, in train
self.call_hook(‘after_train_iter’)
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py”, line 317, in call_hook
getattr(hook, fn_name)(self)
File “”, line 114, in after_train_iter
File “”, line 159, in multi_gpu_test
File “”, line 202, in collect_results_cpu
File “/usr/lib/python3.8/shutil.py”, line 722, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File “/usr/lib/python3.8/shutil.py”, line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: ‘/results/train/.eval_hook’
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 341) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.14.0a0+44dac51’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py FAILED

(Sorry about the large font)
I’ve checked the images and masks; the container reports the correct number for each.

To me it seems like it should be simple and I’m missing something obvious. I notice that when the validation is triggered the file reported in the error is present.

Cheers

Please double check if the directory is not empty.
! tao model segformer run ls /results/train/

If yes, please rm it
! tao model segformer run rm -rf /results/train

And then retry training.

Hi Morgan,

I ran that and got:

ls: cannot access ‘/results/train/’: No such file or directory

Which makes sense as the container is dismounted when it stops? The /results/train/ container location is derived from the spec file?

Thank you

Also when the job is running and you look at the host file system, a directory .eval_hook is there. It is every time there’s a validation run. The current work around for me is to have the validation at the end (when max_iters is reached) and it fails there. At least I have the final .pth model to review, but I’m concerned there’s something else going on.

Thank you.

The $RESULTS_DIR is defined in your notebook. If this path not available inside the docker, please double check above cells.
You can also change to another folder to save the result.

Hi @Morganh , Understood I tried that and it works for the first validation run and then fails in the manner I described above. There’s definitely something odd about this model. I’ve ran another dataset and got very good results but when there’s a validation run it fails as well (the first one completed but the second one fails)

Cheers

To narrow down, could you please try with 1 gpu only?

Hi @Morganh , apologies for the delay in responding. I have tried this setting and the same problem. I have tried different datasets and the same problem. Has anyone internal had this issue?

Also, a clarification question. In the spec file snippet below you will see that I have deliberately set the validation_interval above the max_iters so that the problem described above does not occur. The question is, if you do this and have validate as true, does it still use the validation data set during training?

train:
exp_config:
manual_seed: 49
checkpoint_interval: 2000
logging_interval: 2000
max_iters: 50000
resume_training_checkpoint_path: null
validate: true
validation_interval: 51000
trainer:
find_unused_parameters: true
sf_optim:
lr: 0.006

Cheers