FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrc

Please $mkdir experiment_dir_unpruned

Step:1
When I am trying to train in terminal it throws an error like “epoch-0.tlt is not saving”.
Step:2
So I have run in the jupyter then Epoch-0.tlt saved properly and throws the previous error.
Step:3
So again I run in the terminal then training completed saved epoch-01. as we are first testing this data we only running 2 epochs…
Step:4
In evaluation stage I am getting below error

I have already created the directory

When run evaluation, can you use the absolute path for the tlt model?

More, please follow below tip to modify the ~/.tao_mounts.json file and try again.

Thank you. I will try this one And I will let you know

  • I have tried to keep the line “DockerOptions”: { “user”: “{}:{}”.format(os.getuid(), os.getgid())} in the mount file .tao_mounts.json. and try to train the model getting the below error.

  • Second time I have removed the Docker options in mount file and try to train the model still getting the same error.

  • I have created directoryexperiment_dir_unpruned but ERROR is epoch-o.tlt is not saved. Please check the below screenshot.

Thank you very much for your kind help. I have found the solution.
Before I have trained with less images like train -13 images and val - 13 images.
Now I trained with 3000 images. it is working perfectly.

1 Like

Great. Thanks for the info. Glad to know it is working now.

Sorry, I am back again. When I am Training with num_examples_per_epoch: 50 and total_steps: 100. I am getting below error at training cycle 6.

Could you share training spec file and full training log?

maskrcnn_train_resnet50 (3).txt (2.0 KB)
or multi-GPU, change --gpus based on your machine.
2023-08-17 04:17:14,370 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-08-17 04:17:14,435 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-08-17 04:17:14,532 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-08-17 04:17:17.285797: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-08-17 04:17:17,486 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2023-08-17 04:17:22.086190: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2023-08-17 04:17:22,562 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:22,681 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:22,697 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:23,532 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-g4mwwlhi because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2023-08-17 04:17:23,953 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2023-08-17 04:17:25.616453: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2023-08-17 04:17:25.723723: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:28,443 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:28,480 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:28,484 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
[INFO] Starting MaskRCNN training.
INFO:tensorflow:Using config: {‘model_dir’: '/tmp/tmpw5fhf5r’, ‘_tf_random_seed’: 123, ‘_save_summary_steps’: None, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: None, ‘_session_config’: intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 4
gpu_options {
allow_growth: true
force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: TWO
}
}
, ‘_keep_checkpoint_max’: 20, ‘_keep_checkpoint_every_n_hours’: None, ‘_log_step_count_steps’: None, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb71890ec70>, ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}
[MaskRCNN] INFO : Loading pretrained model…
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:254: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:257: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:258: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : Building model graph…
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 542.1 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

INFO:tensorflow:Done calling model_fn.
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for epoch 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt.
ERROR:tensorflow:Model diverged with loss = NaN.
[INFO] NaN loss during training.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 321, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 313, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 300, in main
run_executer(RUN_CONFIG, train_input_fn, eval_input_fn)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 106, in run_executer
executer.train_and_eval(train_input_fn=train_input_fn, eval_input_fn=eval_input_fn)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py”, line 412, in train_and_eval
train_estimator.train(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1193, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 750, in run
return self._sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1255, in run
return self._sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.8/dist-packages/six.py”, line 719, in reraise
raise value
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1421, in run
hook.after_run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py”, line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Execution status: FAIL
2023-08-17 04:19:20,344 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Make sure

  • the id under categories in the annotation file should start from 1.
  • In annotations dict, the category_id should start from 1 instead of 0.

Please refer to Data Annotation Format - NVIDIA Docs and
The first class is always not detected in inference - #25 by Morganh,
Model diverged with loss = nan.

Ok. In my data file id in categories and category_id in annotations dict starting from 1 only

Please set total_steps: 250000 and retry.
Refer to https://github.com/NVIDIA/tao_tutorials/blob/cdbafd28fec9da67fbfc4db9288ec0805076ce29/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrcnn_train_resnet50.txt

My data size:
train - 3483 images
val - 104 images
Gpus - 1.

what is the minimum step_size we can go for? Because with single Gpu it is not enough for that

Can you run official notebook successfully? Please use the default dataset and config file.

We first run the coco dataset with same config file but only
total_steps: 1
num_examples_per_epoch: 1

It was perfectly fine

An example, you can try to set total_steps: 720

But please modify learning_rate_steps accordingly.
For example, learning_rate_steps: “[200, 360, 480]”

For 1gpu, please set lower bs.
train_batch_size: 1
eval_batch_size: 1

1 Like

Could you please tell me how they(total_steps and learning_rate_steps) are correlated. is it approximation or else?