Hardware:
PC with NVIDIA RTX 2080 Ti (Driver 535.274.02, CUDA 12.2)
Network Type:
Mask R-CNN (TAO Toolkit 5.0.0 — TensorFlow1)
Container Version:
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Command used to launch the container:
docker run --gpus all -it --rm \
--shm-size=32g \
-e TAO_DISABLE_TELEMETRY=1 \
-v ~/Desktop/Rami_FYP/ML_Models/Rami_Data/merged_data/mask_rcnn_workflow_TAO:/workspace/fyp \
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Training command:
mask_rcnn train \
-e /workspace/fyp/specs/maskrcnn_train.prototxt \
-d /workspace/fyp/experiments/maskrcnn_fyp \
-k rami123 \
--gpus 1
Problem Description
Training starts normally, loads the model graph, runs the train and val for 1 epoch, and when it starts with the next epoch, it then fails with this message:
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : Building model graph...
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO : [Training Compute Statistics] 516.6 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4brfspr/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (309 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success...
[MaskRCNN] INFO : Saving checkpoints for epoch 0 into /workspace/fyp/experiments/maskrcnn_fyp/model.epoch-0.tlt.
[MaskRCNN] INFO : Global step 10 (epoch 1/80): total loss: 5.20752 (rpn score loss: 0.61201 rpn box loss: 0.02986 fast_rcnn class loss: 0.06535 fast_rcnn box loss: 0.35009) learning rate: 0.00012
[MaskRCNN] INFO : Global step 20 (epoch 1/80): total loss: 4.63715 (rpn score loss: 0.54459 rpn box loss: 0.04540 fast_rcnn class loss: 0.11666 fast_rcnn box loss: 0.45967) learning rate: 0.00013
[MaskRCNN] INFO : Global step 30 (epoch 1/80): total loss: 4.23867 (rpn score loss: 0.60260 rpn box loss: 0.03222 fast_rcnn class loss: 0.33050 fast_rcnn box loss: 0.08511) learning rate: 0.00015
[MaskRCNN] INFO : Global step 40 (epoch 1/80): total loss: 3.98523 (rpn score loss: 0.38653 rpn box loss: 0.02637 fast_rcnn class loss: 0.11777 fast_rcnn box loss: 0.44554) learning rate: 0.00017
[MaskRCNN] INFO : Global step 50 (epoch 1/80): total loss: 3.70947 (rpn score loss: 0.30903 rpn box loss: 0.02205 fast_rcnn class loss: 0.14553 fast_rcnn box loss: 0.34914) learning rate: 0.00019
[MaskRCNN] INFO : Global step 60 (epoch 1/80): total loss: 3.72080 (rpn score loss: 0.24032 rpn box loss: 0.02516 fast_rcnn class loss: 0.12771 fast_rcnn box loss: 0.47397) learning rate: 0.00021
[MaskRCNN] INFO : Global step 70 (epoch 1/80): total loss: 3.86796 (rpn score loss: 0.17574 rpn box loss: 0.03412 fast_rcnn class loss: 0.13395 fast_rcnn box loss: 0.55011) learning rate: 0.00022
[MaskRCNN] INFO : Global step 80 (epoch 1/80): total loss: 3.82501 (rpn score loss: 0.38071 rpn box loss: 0.03132 fast_rcnn class loss: 0.18896 fast_rcnn box loss: 0.29981) learning rate: 0.00024
[MaskRCNN] INFO : Global step 90 (epoch 1/80): total loss: 4.02203 (rpn score loss: 0.69470 rpn box loss: 0.03337 fast_rcnn class loss: 0.25799 fast_rcnn box loss: 0.17383) learning rate: 0.00026
[MaskRCNN] INFO : Global step 100 (epoch 1/80): total loss: 3.47784 (rpn score loss: 0.16611 rpn box loss: 0.01019 fast_rcnn class loss: 0.14563 fast_rcnn box loss: 0.35927) learning rate: 0.00028
[MaskRCNN] INFO : Global step 110 (epoch 1/80): total loss: 3.51706 (rpn score loss: 0.13525 rpn box loss: 0.04794 fast_rcnn class loss: 0.09923 fast_rcnn box loss: 0.41204) learning rate: 0.00030
[MaskRCNN] INFO : Global step 120 (epoch 1/80): total loss: 3.96364 (rpn score loss: 0.14839 rpn box loss: 0.01674 fast_rcnn class loss: 0.16103 fast_rcnn box loss: 0.65012) learning rate: 0.00031
[MaskRCNN] INFO : Global step 130 (epoch 1/80): total loss: 3.77484 (rpn score loss: 0.13604 rpn box loss: 0.01103 fast_rcnn class loss: 0.19947 fast_rcnn box loss: 0.55928) learning rate: 0.00033
[MaskRCNN] INFO : Global step 140 (epoch 1/80): total loss: 3.86374 (rpn score loss: 0.65349 rpn box loss: 0.04654 fast_rcnn class loss: 0.20532 fast_rcnn box loss: 0.09596) learning rate: 0.00035
[MaskRCNN] INFO : Global step 150 (epoch 1/80): total loss: 3.59504 (rpn score loss: 0.12751 rpn box loss: 0.02448 fast_rcnn class loss: 0.10541 fast_rcnn box loss: 0.49047) learning rate: 0.00037
[MaskRCNN] INFO : Global step 160 (epoch 1/80): total loss: 3.24085 (rpn score loss: 0.19776 rpn box loss: 0.03465 fast_rcnn class loss: 0.11691 fast_rcnn box loss: 0.10234) learning rate: 0.00039
[MaskRCNN] INFO : Global step 170 (epoch 1/80): total loss: 3.53636 (rpn score loss: 0.13899 rpn box loss: 0.01613 fast_rcnn class loss: 0.14474 fast_rcnn box loss: 0.41501) learning rate: 0.00040
[MaskRCNN] INFO : Global step 180 (epoch 1/80): total loss: 3.48006 (rpn score loss: 0.45270 rpn box loss: 0.02109 fast_rcnn class loss: 0.09376 fast_rcnn box loss: 0.13046) learning rate: 0.00042
[MaskRCNN] INFO : Global step 190 (epoch 1/80): total loss: 3.79810 (rpn score loss: 0.18788 rpn box loss: 0.03119 fast_rcnn class loss: 0.14459 fast_rcnn box loss: 0.52509) learning rate: 0.00044
[MaskRCNN] INFO : Global step 200 (epoch 1/80): total loss: 3.89601 (rpn score loss: 0.14355 rpn box loss: 0.04137 fast_rcnn class loss: 0.16103 fast_rcnn box loss: 0.62923) learning rate: 0.00046
[MaskRCNN] INFO : Global step 210 (epoch 1/80): total loss: 3.88345 (rpn score loss: 0.33756 rpn box loss: 0.03372 fast_rcnn class loss: 0.15200 fast_rcnn box loss: 0.42600) learning rate: 0.00048
[MaskRCNN] INFO : Global step 220 (epoch 1/80): total loss: 3.82357 (rpn score loss: 0.11313 rpn box loss: 0.07056 fast_rcnn class loss: 0.14843 fast_rcnn box loss: 0.54826) learning rate: 0.00049
[MaskRCNN] INFO : Global step 230 (epoch 1/80): total loss: 3.29482 (rpn score loss: 0.26319 rpn box loss: 0.02980 fast_rcnn class loss: 0.08147 fast_rcnn box loss: 0.15555) learning rate: 0.00051
[MaskRCNN] INFO : Global step 240 (epoch 1/80): total loss: 3.39277 (rpn score loss: 0.16259 rpn box loss: 0.04577 fast_rcnn class loss: 0.07898 fast_rcnn box loss: 0.28445) learning rate: 0.00053
[MaskRCNN] INFO : Global step 250 (epoch 1/80): total loss: 3.74806 (rpn score loss: 0.10812 rpn box loss: 0.01249 fast_rcnn class loss: 0.13864 fast_rcnn box loss: 0.62587) learning rate: 0.00055
[MaskRCNN] INFO : Global step 260 (epoch 1/80): total loss: 3.59578 (rpn score loss: 0.08890 rpn box loss: 0.02242 fast_rcnn class loss: 0.11853 fast_rcnn box loss: 0.51944) learning rate: 0.00057
[MaskRCNN] INFO : Global step 270 (epoch 1/80): total loss: 3.38003 (rpn score loss: 0.14326 rpn box loss: 0.03004 fast_rcnn class loss: 0.12309 fast_rcnn box loss: 0.29980) learning rate: 0.00058
[MaskRCNN] INFO : Global step 280 (epoch 1/80): total loss: 3.63258 (rpn score loss: 0.10672 rpn box loss: 0.01853 fast_rcnn class loss: 0.16141 fast_rcnn box loss: 0.51917) learning rate: 0.00060
[MaskRCNN] INFO : Global step 290 (epoch 1/80): total loss: 3.50878 (rpn score loss: 0.07796 rpn box loss: 0.00937 fast_rcnn class loss: 0.10412 fast_rcnn box loss: 0.49216) learning rate: 0.00062
[MaskRCNN] INFO : Global step 300 (epoch 1/80): total loss: 3.68461 (rpn score loss: 0.07386 rpn box loss: 0.02691 fast_rcnn class loss: 0.06788 fast_rcnn box loss: 0.65287) learning rate: 0.00064
[MaskRCNN] INFO : Global step 310 (epoch 1/80): total loss: 3.79577 (rpn score loss: 0.12678 rpn box loss: 0.03136 fast_rcnn class loss: 0.17135 fast_rcnn box loss: 0.62608) learning rate: 0.00066
[MaskRCNN] INFO : Global step 320 (epoch 1/80): total loss: 3.49812 (rpn score loss: 0.30382 rpn box loss: 0.01665 fast_rcnn class loss: 0.16422 fast_rcnn box loss: 0.23499) learning rate: 0.00067
[MaskRCNN] INFO : Global step 330 (epoch 1/80): total loss: 3.39509 (rpn score loss: 0.14926 rpn box loss: 0.01127 fast_rcnn class loss: 0.18745 fast_rcnn box loss: 0.29657) learning rate: 0.00069
[MaskRCNN] INFO : Global step 340 (epoch 1/80): total loss: 3.69726 (rpn score loss: 0.07150 rpn box loss: 0.04261 fast_rcnn class loss: 0.10565 fast_rcnn box loss: 0.59651) learning rate: 0.00071
[MaskRCNN] INFO : Global step 350 (epoch 1/80): total loss: 3.66712 (rpn score loss: 0.25832 rpn box loss: 0.03146 fast_rcnn class loss: 0.20199 fast_rcnn box loss: 0.29746) learning rate: 0.00073
[MaskRCNN] INFO : Global step 360 (epoch 1/80): total loss: 3.60247 (rpn score loss: 0.22289 rpn box loss: 0.03379 fast_rcnn class loss: 0.17936 fast_rcnn box loss: 0.38084) learning rate: 0.00075
[MaskRCNN] INFO : Global step 370 (epoch 1/80): total loss: 4.08682 (rpn score loss: 0.65837 rpn box loss: 0.04571 fast_rcnn class loss: 0.26523 fast_rcnn box loss: 0.18315) learning rate: 0.00076
[MaskRCNN] INFO : Global step 380 (epoch 1/80): total loss: 3.67545 (rpn score loss: 0.09191 rpn box loss: 0.01238 fast_rcnn class loss: 0.10415 fast_rcnn box loss: 0.59665) learning rate: 0.00078
[MaskRCNN] INFO : Global step 390 (epoch 1/80): total loss: 3.40259 (rpn score loss: 0.15155 rpn box loss: 0.01665 fast_rcnn class loss: 0.11172 fast_rcnn box loss: 0.35199) learning rate: 0.00080
[MaskRCNN] INFO : Global step 400 (epoch 1/80): total loss: 3.74400 (rpn score loss: 0.38868 rpn box loss: 0.03371 fast_rcnn class loss: 0.32566 fast_rcnn box loss: 0.17396) learning rate: 0.00082
[MaskRCNN] INFO : Global step 410 (epoch 1/80): total loss: 3.49370 (rpn score loss: 0.24986 rpn box loss: 0.02272 fast_rcnn class loss: 0.23084 fast_rcnn box loss: 0.22342) learning rate: 0.00084
[MaskRCNN] INFO : Global step 420 (epoch 1/80): total loss: 3.65861 (rpn score loss: 0.25015 rpn box loss: 0.02878 fast_rcnn class loss: 0.20944 fast_rcnn box loss: 0.23010) learning rate: 0.00085
[MaskRCNN] INFO : Global step 430 (epoch 1/80): total loss: 3.89399 (rpn score loss: 0.41117 rpn box loss: 0.03442 fast_rcnn class loss: 0.39298 fast_rcnn box loss: 0.20511) learning rate: 0.00087
[MaskRCNN] INFO : Global step 440 (epoch 1/80): total loss: 3.63664 (rpn score loss: 0.24047 rpn box loss: 0.02712 fast_rcnn class loss: 0.24869 fast_rcnn box loss: 0.32587) learning rate: 0.00089
[MaskRCNN] INFO : Global step 450 (epoch 1/80): total loss: 3.79023 (rpn score loss: 0.39219 rpn box loss: 0.02427 fast_rcnn class loss: 0.43519 fast_rcnn box loss: 0.10614) learning rate: 0.00091
[MaskRCNN] INFO : Global step 460 (epoch 1/80): total loss: 3.58539 (rpn score loss: 0.23256 rpn box loss: 0.04284 fast_rcnn class loss: 0.23157 fast_rcnn box loss: 0.29416) learning rate: 0.00093
[MaskRCNN] INFO : Global step 470 (epoch 1/80): total loss: 3.92394 (rpn score loss: 0.39608 rpn box loss: 0.02867 fast_rcnn class loss: 0.39164 fast_rcnn box loss: 0.32709) learning rate: 0.00094
[MaskRCNN] INFO : Global step 480 (epoch 1/80): total loss: 3.60576 (rpn score loss: 0.16243 rpn box loss: 0.02292 fast_rcnn class loss: 0.20319 fast_rcnn box loss: 0.43584) learning rate: 0.00096
[MaskRCNN] INFO : Global step 490 (epoch 1/80): total loss: 3.61445 (rpn score loss: 0.04854 rpn box loss: 0.00853 fast_rcnn class loss: 0.11651 fast_rcnn box loss: 0.61977) learning rate: 0.00098
[MaskRCNN] INFO : Global step 500 (epoch 1/80): total loss: 3.56281 (rpn score loss: 0.10191 rpn box loss: 0.01263 fast_rcnn class loss: 0.12622 fast_rcnn box loss: 0.46735) learning rate: 0.00100
[MaskRCNN] INFO : Global step 510 (epoch 1/80): total loss: 3.32369 (rpn score loss: 0.05528 rpn box loss: 0.01359 fast_rcnn class loss: 0.09569 fast_rcnn box loss: 0.38604) learning rate: 0.00100
[MaskRCNN] INFO : Global step 520 (epoch 1/80): total loss: 3.45664 (rpn score loss: 0.07883 rpn box loss: 0.04745 fast_rcnn class loss: 0.10325 fast_rcnn box loss: 0.43433) learning rate: 0.00100
[MaskRCNN] INFO : Global step 530 (epoch 1/80): total loss: 3.53231 (rpn score loss: 0.06332 rpn box loss: 0.01996 fast_rcnn class loss: 0.17204 fast_rcnn box loss: 0.45074) learning rate: 0.00100
[MaskRCNN] INFO : Global step 540 (epoch 1/80): total loss: 3.44095 (rpn score loss: 0.07283 rpn box loss: 0.01560 fast_rcnn class loss: 0.14119 fast_rcnn box loss: 0.42358) learning rate: 0.00100
[MaskRCNN] INFO : Global step 550 (epoch 1/80): total loss: 3.83361 (rpn score loss: 0.20690 rpn box loss: 0.03243 fast_rcnn class loss: 0.34327 fast_rcnn box loss: 0.45627) learning rate: 0.00100
[MaskRCNN] INFO : Global step 560 (epoch 1/80): total loss: 3.31315 (rpn score loss: 0.13814 rpn box loss: 0.01340 fast_rcnn class loss: 0.15641 fast_rcnn box loss: 0.22113) learning rate: 0.00100
[MaskRCNN] INFO : Global step 570 (epoch 1/80): total loss: 3.63777 (rpn score loss: 0.15626 rpn box loss: 0.03240 fast_rcnn class loss: 0.20851 fast_rcnn box loss: 0.34752) learning rate: 0.00100
[MaskRCNN] INFO : Global step 580 (epoch 1/80): total loss: 3.50107 (rpn score loss: 0.08407 rpn box loss: 0.04541 fast_rcnn class loss: 0.11819 fast_rcnn box loss: 0.45455) learning rate: 0.00100
[MaskRCNN] INFO : Global step 590 (epoch 1/80): total loss: 3.61385 (rpn score loss: 0.08391 rpn box loss: 0.02316 fast_rcnn class loss: 0.12032 fast_rcnn box loss: 0.56855) learning rate: 0.00100
[MaskRCNN] INFO : Global step 600 (epoch 1/80): total loss: 4.20732 (rpn score loss: 0.64709 rpn box loss: 0.05505 fast_rcnn class loss: 0.45570 fast_rcnn box loss: 0.20300) learning rate: 0.00100
[MaskRCNN] INFO : Global step 610 (epoch 1/80): total loss: 3.24434 (rpn score loss: 0.07851 rpn box loss: 0.00819 fast_rcnn class loss: 0.08293 fast_rcnn box loss: 0.38932) learning rate: 0.00100
[MaskRCNN] INFO : Global step 620 (epoch 1/80): total loss: 3.54244 (rpn score loss: 0.03246 rpn box loss: 0.01119 fast_rcnn class loss: 0.15867 fast_rcnn box loss: 0.55238) learning rate: 0.00100
[MaskRCNN] INFO : Global step 630 (epoch 1/80): total loss: 3.70553 (rpn score loss: 0.25289 rpn box loss: 0.03375 fast_rcnn class loss: 0.28609 fast_rcnn box loss: 0.26534) learning rate: 0.00100
[MaskRCNN] INFO : Global step 640 (epoch 1/80): total loss: 3.46454 (rpn score loss: 0.04449 rpn box loss: 0.01826 fast_rcnn class loss: 0.11437 fast_rcnn box loss: 0.56433) learning rate: 0.00100
[MaskRCNN] INFO : Global step 650 (epoch 1/80): total loss: 3.44091 (rpn score loss: 0.06358 rpn box loss: 0.03270 fast_rcnn class loss: 0.13000 fast_rcnn box loss: 0.47271) learning rate: 0.00100
[MaskRCNN] INFO : Global step 660 (epoch 1/80): total loss: 3.45321 (rpn score loss: 0.03651 rpn box loss: 0.01053 fast_rcnn class loss: 0.11414 fast_rcnn box loss: 0.47083) learning rate: 0.00100
[MaskRCNN] INFO : Global step 670 (epoch 1/80): total loss: 3.39485 (rpn score loss: 0.05236 rpn box loss: 0.02159 fast_rcnn class loss: 0.11029 fast_rcnn box loss: 0.44708) learning rate: 0.00100
[MaskRCNN] INFO : Global step 680 (epoch 1/80): total loss: 3.55119 (rpn score loss: 0.18477 rpn box loss: 0.03453 fast_rcnn class loss: 0.22440 fast_rcnn box loss: 0.38985) learning rate: 0.00100
[MaskRCNN] INFO : Global step 690 (epoch 1/80): total loss: 3.46719 (rpn score loss: 0.17302 rpn box loss: 0.01830 fast_rcnn class loss: 0.22159 fast_rcnn box loss: 0.35161) learning rate: 0.00100
[MaskRCNN] INFO : Global step 700 (epoch 1/80): total loss: 3.68496 (rpn score loss: 0.25162 rpn box loss: 0.02782 fast_rcnn class loss: 0.31603 fast_rcnn box loss: 0.35100) learning rate: 0.00100
[MaskRCNN] INFO : Global step 710 (epoch 1/80): total loss: 3.59235 (rpn score loss: 0.09769 rpn box loss: 0.03143 fast_rcnn class loss: 0.21235 fast_rcnn box loss: 0.50969) learning rate: 0.00100
[MaskRCNN] INFO : Global step 720 (epoch 1/80): total loss: 3.68763 (rpn score loss: 0.21301 rpn box loss: 0.03312 fast_rcnn class loss: 0.28394 fast_rcnn box loss: 0.42362) learning rate: 0.00100
[MaskRCNN] INFO : Global step 730 (epoch 1/80): total loss: 3.48760 (rpn score loss: 0.08802 rpn box loss: 0.02152 fast_rcnn class loss: 0.19061 fast_rcnn box loss: 0.48414) learning rate: 0.00100
[MaskRCNN] INFO : Global step 740 (epoch 1/80): total loss: 3.44055 (rpn score loss: 0.15659 rpn box loss: 0.02653 fast_rcnn class loss: 0.17202 fast_rcnn box loss: 0.37699) learning rate: 0.00100
[MaskRCNN] INFO : Global step 750 (epoch 1/80): total loss: 3.30072 (rpn score loss: 0.05789 rpn box loss: 0.01760 fast_rcnn class loss: 0.11668 fast_rcnn box loss: 0.39960) learning rate: 0.00100
[MaskRCNN] INFO : Global step 760 (epoch 1/80): total loss: 3.46550 (rpn score loss: 0.09144 rpn box loss: 0.01609 fast_rcnn class loss: 0.17086 fast_rcnn box loss: 0.45357) learning rate: 0.00100
[MaskRCNN] INFO : Global step 770 (epoch 1/80): total loss: 3.62472 (rpn score loss: 0.09353 rpn box loss: 0.06809 fast_rcnn class loss: 0.18071 fast_rcnn box loss: 0.46898) learning rate: 0.00100
[MaskRCNN] INFO : Global step 780 (epoch 1/80): total loss: 3.61292 (rpn score loss: 0.30926 rpn box loss: 0.03466 fast_rcnn class loss: 0.32483 fast_rcnn box loss: 0.16482) learning rate: 0.00100
[MaskRCNN] INFO : Global step 790 (epoch 1/80): total loss: 3.40686 (rpn score loss: 0.08852 rpn box loss: 0.02564 fast_rcnn class loss: 0.17241 fast_rcnn box loss: 0.41358) learning rate: 0.00100
[MaskRCNN] INFO : Global step 800 (epoch 1/80): total loss: 3.38520 (rpn score loss: 0.04533 rpn box loss: 0.02320 fast_rcnn class loss: 0.13714 fast_rcnn box loss: 0.40360) learning rate: 0.00100
[MaskRCNN] INFO : Global step 810 (epoch 1/80): total loss: 3.65710 (rpn score loss: 0.41762 rpn box loss: 0.02824 fast_rcnn class loss: 0.35193 fast_rcnn box loss: 0.13672) learning rate: 0.00100
[MaskRCNN] INFO : Global step 820 (epoch 1/80): total loss: 4.03538 (rpn score loss: 0.40264 rpn box loss: 0.02181 fast_rcnn class loss: 0.57341 fast_rcnn box loss: 0.29650) learning rate: 0.00100
[MaskRCNN] INFO : Global step 830 (epoch 1/80): total loss: 3.27442 (rpn score loss: 0.06124 rpn box loss: 0.01841 fast_rcnn class loss: 0.13731 fast_rcnn box loss: 0.42416) learning rate: 0.00100
[MaskRCNN] INFO : Global step 840 (epoch 1/80): total loss: 3.31723 (rpn score loss: 0.04174 rpn box loss: 0.01203 fast_rcnn class loss: 0.19007 fast_rcnn box loss: 0.40364) learning rate: 0.00100
[MaskRCNN] INFO : Global step 850 (epoch 1/80): total loss: 3.51194 (rpn score loss: 0.15601 rpn box loss: 0.02985 fast_rcnn class loss: 0.24143 fast_rcnn box loss: 0.37560) learning rate: 0.00100
[MaskRCNN] INFO : Global step 860 (epoch 1/80): total loss: 3.31428 (rpn score loss: 0.06729 rpn box loss: 0.03045 fast_rcnn class loss: 0.15430 fast_rcnn box loss: 0.40920) learning rate: 0.00100
[MaskRCNN] INFO : Global step 870 (epoch 1/80): total loss: 3.15262 (rpn score loss: 0.03966 rpn box loss: 0.01307 fast_rcnn class loss: 0.13025 fast_rcnn box loss: 0.34433) learning rate: 0.00100
[MaskRCNN] INFO : Global step 880 (epoch 1/80): total loss: 3.67152 (rpn score loss: 0.34548 rpn box loss: 0.02869 fast_rcnn class loss: 0.34608 fast_rcnn box loss: 0.13637) learning rate: 0.00100
[MaskRCNN] INFO : Global step 890 (epoch 1/80): total loss: 3.51194 (rpn score loss: 0.12095 rpn box loss: 0.02022 fast_rcnn class loss: 0.25010 fast_rcnn box loss: 0.40911) learning rate: 0.00100
[MaskRCNN] INFO : Global step 900 (epoch 1/80): total loss: 3.71499 (rpn score loss: 0.19847 rpn box loss: 0.02536 fast_rcnn class loss: 0.36255 fast_rcnn box loss: 0.36140) learning rate: 0.00100
[MaskRCNN] INFO : Global step 910 (epoch 1/80): total loss: 3.55288 (rpn score loss: 0.17814 rpn box loss: 0.02503 fast_rcnn class loss: 0.33894 fast_rcnn box loss: 0.31949) learning rate: 0.00100
[MaskRCNN] INFO : Global step 920 (epoch 1/80): total loss: 3.76785 (rpn score loss: 0.07082 rpn box loss: 0.03941 fast_rcnn class loss: 0.32042 fast_rcnn box loss: 0.50948) learning rate: 0.00100
[MaskRCNN] INFO : Global step 930 (epoch 1/80): total loss: 3.25494 (rpn score loss: 0.02962 rpn box loss: 0.00551 fast_rcnn class loss: 0.18392 fast_rcnn box loss: 0.38825) learning rate: 0.00100
[MaskRCNN] INFO : Global step 940 (epoch 1/80): total loss: 3.40572 (rpn score loss: 0.11224 rpn box loss: 0.04393 fast_rcnn class loss: 0.26890 fast_rcnn box loss: 0.36776) learning rate: 0.00100
[MaskRCNN] INFO : Global step 950 (epoch 1/80): total loss: 3.28233 (rpn score loss: 0.07506 rpn box loss: 0.01128 fast_rcnn class loss: 0.20046 fast_rcnn box loss: 0.37531) learning rate: 0.00100
[MaskRCNN] INFO : Global step 960 (epoch 1/80): total loss: 3.19567 (rpn score loss: 0.05232 rpn box loss: 0.02162 fast_rcnn class loss: 0.13239 fast_rcnn box loss: 0.35097) learning rate: 0.00100
[MaskRCNN] INFO : Global step 970 (epoch 1/80): total loss: 3.49493 (rpn score loss: 0.05453 rpn box loss: 0.01360 fast_rcnn class loss: 0.31856 fast_rcnn box loss: 0.42678) learning rate: 0.00100
[MaskRCNN] INFO : Global step 980 (epoch 1/80): total loss: 3.46868 (rpn score loss: 0.03224 rpn box loss: 0.03284 fast_rcnn class loss: 0.21213 fast_rcnn box loss: 0.54326) learning rate: 0.00100
[MaskRCNN] INFO : Global step 990 (epoch 1/80): total loss: 3.32783 (rpn score loss: 0.08340 rpn box loss: 0.02152 fast_rcnn class loss: 0.23354 fast_rcnn box loss: 0.32734) learning rate: 0.00100
[MaskRCNN] INFO : Global step 1000 (epoch 1/80): total loss: 3.45508 (rpn score loss: 0.17086 rpn box loss: 0.04418 fast_rcnn class loss: 0.19795 fast_rcnn box loss: 0.35976) learning rate: 0.00100
[MaskRCNN] INFO : Global step 1010 (epoch 1/80): total loss: 3.11894 (rpn score loss: 0.04871 rpn box loss: 0.00996 fast_rcnn class loss: 0.17435 fast_rcnn box loss: 0.31044) learning rate: 0.00100
[MaskRCNN] INFO : Global step 1020 (epoch 1/80): total loss: 3.16048 (rpn score loss: 0.09671 rpn box loss: 0.01771 fast_rcnn class loss: 0.14560 fast_rcnn box loss: 0.27713) learning rate: 0.00100
[INFO] None
[MaskRCNN] INFO : Epoch 1/80: loss: 3.11241 learning rate: 0.00100 Time taken: 0:08:11.191832 ETA: 10:46:44.154752
[MaskRCNN] INFO : Saving checkpoints for epoch 1 into /workspace/fyp/experiments/maskrcnn_fyp/model.epoch-1.tlt.
INFO:tensorflow:Loss for final step: 3.3392217.
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start evaluation cycle 01
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : [eval] AMP is activated - Experiment Feature
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpd4brfspr', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
allow_growth: true
force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: TWO
auto_mixed_precision: ON
}
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc679e89310>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO : Loading weights from /workspace/fyp/experiments/maskrcnn_fyp/model.epoch-1.tlt
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
[MaskRCNN] INFO : [*] Limiting the amount of sample to: 84
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : Building model graph...
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO : [Inference Compute Statistics] 504.3 GFLOPS/image
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4brfspr/model.ckpt-1024
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[MaskRCNN] INFO : Running inference on batch 001/042... - Step Time: 7.9444s - Throughput: 0.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 002/042... - Step Time: 0.0685s - Throughput: 29.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 003/042... - Step Time: 0.0664s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO : Running inference on batch 004/042... - Step Time: 0.0662s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 005/042... - Step Time: 0.0659s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 006/042... - Step Time: 0.0665s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO : Running inference on batch 007/042... - Step Time: 0.0660s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 008/042... - Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 009/042... - Step Time: 0.0659s - Throughput: 30.4 imgs/s
[MaskRCNN] INFO : Running inference on batch 010/042... - Step Time: 0.0655s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 011/042... - Step Time: 0.0670s - Throughput: 29.8 imgs/s
[MaskRCNN] INFO : Running inference on batch 012/042... - Step Time: 0.0660s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 013/042... - Step Time: 0.0669s - Throughput: 29.9 imgs/s
[MaskRCNN] INFO : Running inference on batch 014/042... - Step Time: 0.0657s - Throughput: 30.4 imgs/s
[MaskRCNN] INFO : Running inference on batch 015/042... - Step Time: 0.0660s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 016/042... - Step Time: 0.0658s - Throughput: 30.4 imgs/s
[MaskRCNN] INFO : Running inference on batch 017/042... - Step Time: 0.0662s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 018/042... - Step Time: 0.0661s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 019/042... - Step Time: 0.0678s - Throughput: 29.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 020/042... - Step Time: 0.0765s - Throughput: 26.1 imgs/s
[MaskRCNN] INFO : Running inference on batch 021/042... - Step Time: 0.0755s - Throughput: 26.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 022/042... - Step Time: 0.0747s - Throughput: 26.8 imgs/s
[MaskRCNN] INFO : Running inference on batch 023/042... - Step Time: 0.1209s - Throughput: 16.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 024/042... - Step Time: 0.0722s - Throughput: 27.7 imgs/s
[MaskRCNN] INFO : Running inference on batch 025/042... - Step Time: 0.0696s - Throughput: 28.7 imgs/s
[MaskRCNN] INFO : Running inference on batch 026/042... - Step Time: 0.0859s - Throughput: 23.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 027/042... - Step Time: 0.0728s - Throughput: 27.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 028/042... - Step Time: 0.0706s - Throughput: 28.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 029/042... - Step Time: 0.0763s - Throughput: 26.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 030/042... - Step Time: 0.0704s - Throughput: 28.4 imgs/s
[MaskRCNN] INFO : Running inference on batch 031/042... - Step Time: 0.0661s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 032/042... - Step Time: 0.0644s - Throughput: 31.0 imgs/s
[MaskRCNN] INFO : Running inference on batch 033/042... - Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 034/042... - Step Time: 0.0664s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO : Running inference on batch 035/042... - Step Time: 0.0664s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO : Running inference on batch 036/042... - Step Time: 0.0661s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 037/042... - Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 038/042... - Step Time: 0.0655s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 039/042... - Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO : Running inference on batch 040/042... - Step Time: 0.0656s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO : Running inference on batch 041/042... - Step Time: 0.0661s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO : Running inference on batch 042/042... - Step Time: 0.0656s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO : Loading and preparing results...
[MaskRCNN] INFO : 0/8400
[MaskRCNN] INFO : 1000/8400
[MaskRCNN] INFO : 2000/8400
[MaskRCNN] INFO : 3000/8400
[MaskRCNN] INFO : 4000/8400
[MaskRCNN] INFO : 5000/8400
[MaskRCNN] INFO : 6000/8400
[MaskRCNN] INFO : 7000/8400
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=1.95s).
Accumulating evaluation results...
DONE (t=0.04s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.057
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.172
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.020
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.034
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.115
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.023
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.084
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.157
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.117
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.266
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=1.96s).
Accumulating evaluation results...
DONE (t=0.04s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.059
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.165
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.019
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.039
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.116
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.025
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.080
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.140
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.003
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.106
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.232
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Evaluation Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Average throughput: -1.0 samples/sec
[MaskRCNN] INFO : Total processed steps: 42
[MaskRCNN] INFO : Total processing time: 0.0h 19m 52s
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : AP: 0.056989938
[MaskRCNN] INFO : AP50: 0.172287852
[MaskRCNN] INFO : AP75: 0.019901801
[MaskRCNN] INFO : APl: 0.115367487
[MaskRCNN] INFO : APm: 0.033995900
[MaskRCNN] INFO : APs: 0.000000000
[MaskRCNN] INFO : ARl: 0.266374260
[MaskRCNN] INFO : ARm: 0.116838045
[MaskRCNN] INFO : ARmax1: 0.023229707
[MaskRCNN] INFO : ARmax10: 0.084196888
[MaskRCNN] INFO : ARmax100: 0.157167524
[MaskRCNN] INFO : ARs: 0.000000000
[MaskRCNN] INFO : mask_AP: 0.059233427
[MaskRCNN] INFO : mask_AP50: 0.165124863
[MaskRCNN] INFO : mask_AP75: 0.019104565
[MaskRCNN] INFO : mask_APl: 0.115515165
[MaskRCNN] INFO : mask_APm: 0.038957115
[MaskRCNN] INFO : mask_APs: 0.000495050
[MaskRCNN] INFO : mask_ARl: 0.231871352
[MaskRCNN] INFO : mask_ARm: 0.105784059
[MaskRCNN] INFO : mask_ARmax1: 0.024784110
[MaskRCNN] INFO : mask_ARmax10: 0.079533681
[MaskRCNN] INFO : mask_ARmax100: 0.139637306
[MaskRCNN] INFO : mask_ARs: 0.002631579
[INFO] Evaluation metrics generated.
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 02
[MaskRCNN] INFO : =================================
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : Building model graph...
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO : [Training Compute Statistics] 516.6 GFLOPS/image
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4brfspr/model.ckpt-1024
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: Invalid URL '': No scheme supplied. Perhaps you meant https://?
Execution status: FAIL
root@fe2d81786c8a:/workspace#
Sometimes it also shows:
SSLCertVerificationError: certificate verify failed (telemetry.metropolis.nvidia.com)
Execution status: FAIL
These errors interrupt training immediately.
I already set:
TAO_DISABLE_TELEMETRY=1
—but TAO TF1 still tries to contact the telemetry server and fails.
❓ Questions
-
Is this a known issue with TAO Toolkit 5.0.0-TF1?
-
Is there an officially supported way to fully disable telemetry in the TF1 container?
-
Why does TAO still fail even when telemetry is disabled?
Additional Information
The folder structure is correct and exists inside the container:
/workspace/fyp/specs/maskrcnn_train.prototxt
/workspace/fyp/experiments/maskrcnn_fyp
The same project worked before but stopped working once the telemetry error started.
GPU is available and recognized (nvidia-smi works inside container).
Thanks for any help—this is blocking my training.