FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrc

Morganh · August 16, 2023, 6:47am

Please $mkdir experiment_dir_unpruned

ssanthosh2286 · August 16, 2023, 6:52am

Step:1
When I am trying to train in terminal it throws an error like “epoch-0.tlt is not saving”.
Step:2
So I have run in the jupyter then Epoch-0.tlt saved properly and throws the previous error.
Step:3
So again I run in the terminal then training completed saved epoch-01. as we are first testing this data we only running 2 epochs…
Step:4
In evaluation stage I am getting below error

ssanthosh2286 · August 16, 2023, 6:52am

I have already created the directory

Morganh · August 16, 2023, 6:59am

When run evaluation, can you use the absolute path for the tlt model?

Morganh · August 16, 2023, 7:03am

More, please follow below tip to modify the ~/.tao_mounts.json file and try again.

ssanthosh2286 · August 16, 2023, 7:05am

Thank you. I will try this one And I will let you know

ssanthosh2286 · August 17, 2023, 12:21am

I have tried to keep the line “DockerOptions”: { “user”: “{}:{}”.format(os.getuid(), os.getgid())} in the mount file .tao_mounts.json. and try to train the model getting the below error.
Second time I have removed the Docker options in mount file and try to train the model still getting the same error.
I have created directoryexperiment_dir_unpruned but ERROR is epoch-o.tlt is not saved. Please check the below screenshot.

Screenshot 2023-08-17 1014151528×991 50.3 KB

ssanthosh2286 · August 17, 2023, 3:23am

Thank you very much for your kind help. I have found the solution.
Before I have trained with less images like train -13 images and val - 13 images.
Now I trained with 3000 images. it is working perfectly.

Morganh · August 17, 2023, 3:27am

Great. Thanks for the info. Glad to know it is working now.

ssanthosh2286 · August 17, 2023, 3:49am

Sorry, I am back again. When I am Training with num_examples_per_epoch: 50 and total_steps: 100. I am getting below error at training cycle 6.

Morganh · August 17, 2023, 4:38am

Could you share training spec file and full training log?

ssanthosh2286 · August 17, 2023, 4:47am

maskrcnn_train_resnet50 (3).txt (2.0 KB)
or multi-GPU, change --gpus based on your machine.
2023-08-17 04:17:14,370 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2023-08-17 04:17:14,435 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-08-17 04:17:14,532 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-08-17 04:17:17.285797: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-08-17 04:17:17,486 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2023-08-17 04:17:22.086190: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2023-08-17 04:17:22,562 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:22,681 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:22,697 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:23,532 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-g4mwwlhi because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2023-08-17 04:17:23,953 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2023-08-17 04:17:25.616453: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2023-08-17 04:17:25.723723: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:28,443 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:28,480 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-08-17 04:17:28,484 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet50.txt
[INFO] Starting MaskRCNN training.
INFO:tensorflow:Using config: {‘model_dir’: '/tmp/tmpw5fhf5r’, ‘_tf_random_seed’: 123, ‘_save_summary_steps’: None, ‘_save_checkpoints_steps’: None, ‘_save_checkpoints_secs’: None, ‘_session_config’: intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 4
gpu_options {
allow_growth: true
force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: TWO
}
}
, ‘_keep_checkpoint_max’: 20, ‘_keep_checkpoint_every_n_hours’: None, ‘_log_step_count_steps’: None, ‘_train_distribute’: None, ‘_device_fn’: None, ‘_protocol’: None, ‘_eval_distribute’: None, ‘_experimental_distribute’: None, ‘_experimental_max_worker_delay_secs’: None, ‘_session_creation_timeout_secs’: 7200, ‘_service’: None, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb71890ec70>, ‘_task_type’: ‘worker’, ‘_task_id’: 0, ‘_global_id_in_cluster’: 0, ‘_master’: ‘’, ‘_evaluation_master’: ‘’, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_num_worker_replicas’: 1}
[MaskRCNN] INFO : Loading pretrained model…
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:254: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:257: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py:258: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : Building model graph…
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 542.1 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
GitHub - tensorflow/addons: Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
GitHub - tensorflow/io: Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Done calling model_fn.
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [l5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [post_hoc_d5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-class/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn-box/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc6/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fc7/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [class-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l0/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l1/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask-conv-l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_fcn_logits/bias]
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (265 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for epoch 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt.
ERROR:tensorflow:Model diverged with loss = NaN.
[INFO] NaN loss during training.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 321, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 313, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 300, in main
run_executer(RUN_CONFIG, train_input_fn, eval_input_fn)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py”, line 106, in run_executer
executer.train_and_eval(train_input_fn=train_input_fn, eval_input_fn=eval_input_fn)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py”, line 412, in train_and_eval
train_estimator.train(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1193, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File “/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 750, in run
return self._sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1255, in run
return self._sess.run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.8/dist-packages/six.py”, line 719, in reraise
raise value
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1421, in run
hook.after_run(
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/basic_session_run_hooks.py”, line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Execution status: FAIL
2023-08-17 04:19:20,344 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Morganh · August 17, 2023, 5:04am

Make sure

the id under categories in the annotation file should start from 1.
In annotations dict, the category_id should start from 1 instead of 0.

Please refer to Data Annotation Format - NVIDIA Docs and
The first class is always not detected in inference - #25 by Morganh,
Model diverged with loss = nan.

ssanthosh2286 · August 17, 2023, 5:08am

Ok. In my data file id in categories and category_id in annotations dict starting from 1 only

Morganh · August 17, 2023, 5:11am

Please set total_steps: 250000 and retry.
Refer to https://github.com/NVIDIA/tao_tutorials/blob/cdbafd28fec9da67fbfc4db9288ec0805076ce29/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrcnn_train_resnet50.txt

ssanthosh2286 · August 17, 2023, 5:28am

My data size:
train - 3483 images
val - 104 images
Gpus - 1.

what is the minimum step_size we can go for? Because with single Gpu it is not enough for that

Morganh · August 17, 2023, 5:41am

Can you run official notebook successfully? Please use the default dataset and config file.

ssanthosh2286 · August 17, 2023, 5:47am

We first run the coco dataset with same config file but only
total_steps: 1
num_examples_per_epoch: 1

It was perfectly fine

Morganh · August 17, 2023, 5:49am

An example, you can try to set total_steps: 720

But please modify learning_rate_steps accordingly.
For example, learning_rate_steps: “[200, 360, 480]”

For 1gpu, please set lower bs.
train_batch_size: 1
eval_batch_size: 1

ssanthosh2286 · August 17, 2023, 5:57am

Could you please tell me how they(total_steps and learning_rate_steps) are correlated. is it approximation or else?

Topic		Replies	Views
MaskRCNN Input to reshape is a tensor with 3135248 values, but the requested shape has 2691200 TAO Toolkit	38	1484	May 9, 2023
Tao toolkit facenet Error TAO Toolkit	14	1433	March 7, 2022
Error when runing nvidia tao with mask_rcnn Maxine cuda , tensorflow , ubuntu , python , tao	0	409	May 24, 2023
Train mask-rcnn failure TAO Toolkit tao	16	1354	November 25, 2021
Permission denied: 'mrcnn_log.json' while converting data into tfrecords TAO Toolkit	9	1018	August 16, 2022
Tao toolkit detectnet training kitty format error TAO Toolkit	10	537	December 8, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1648	July 6, 2022
Train.yaml Doesn't exist! TAO Toolkit	16	655	June 11, 2024
Tao Training failing on creating directory on a standard example TAO Toolkit tao	10	921	September 6, 2022
Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] TAO Toolkit	51	1788	July 26, 2022

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/getting_started_v5.0.0/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/maskrc

Related topics