AO Toolkit 5.0.0-TF1 MaskRCNN fails with telemetry error even when disabled (“Execution status: FAIL”)

Hardware:
PC with NVIDIA RTX 2080 Ti (Driver 535.274.02, CUDA 12.2)

Network Type:
Mask R-CNN (TAO Toolkit 5.0.0 — TensorFlow1)

Container Version:

nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

Command used to launch the container:

docker run --gpus all -it --rm \
   --shm-size=32g \
   -e TAO_DISABLE_TELEMETRY=1 \
   -v ~/Desktop/Rami_FYP/ML_Models/Rami_Data/merged_data/mask_rcnn_workflow_TAO:/workspace/fyp \
   nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

Training command:

mask_rcnn train \
  -e /workspace/fyp/specs/maskrcnn_train.prototxt \
  -d /workspace/fyp/experiments/maskrcnn_fyp \
  -k rami123 \
  --gpus 1


Problem Description

Training starts normally, loads the model graph, runs the train and val for 1 epoch, and when it starts with the next epoch, it then fails with this message:

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 01
[MaskRCNN] INFO    : =================================
    
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 516.6 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4brfspr/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (309 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for epoch 0 into /workspace/fyp/experiments/maskrcnn_fyp/model.epoch-0.tlt.
[MaskRCNN] INFO    : Global step 10 (epoch 1/80): total loss: 5.20752 (rpn score loss: 0.61201 rpn box loss: 0.02986 fast_rcnn class loss: 0.06535 fast_rcnn box loss: 0.35009) learning rate: 0.00012
[MaskRCNN] INFO    : Global step 20 (epoch 1/80): total loss: 4.63715 (rpn score loss: 0.54459 rpn box loss: 0.04540 fast_rcnn class loss: 0.11666 fast_rcnn box loss: 0.45967) learning rate: 0.00013
[MaskRCNN] INFO    : Global step 30 (epoch 1/80): total loss: 4.23867 (rpn score loss: 0.60260 rpn box loss: 0.03222 fast_rcnn class loss: 0.33050 fast_rcnn box loss: 0.08511) learning rate: 0.00015
[MaskRCNN] INFO    : Global step 40 (epoch 1/80): total loss: 3.98523 (rpn score loss: 0.38653 rpn box loss: 0.02637 fast_rcnn class loss: 0.11777 fast_rcnn box loss: 0.44554) learning rate: 0.00017
[MaskRCNN] INFO    : Global step 50 (epoch 1/80): total loss: 3.70947 (rpn score loss: 0.30903 rpn box loss: 0.02205 fast_rcnn class loss: 0.14553 fast_rcnn box loss: 0.34914) learning rate: 0.00019
[MaskRCNN] INFO    : Global step 60 (epoch 1/80): total loss: 3.72080 (rpn score loss: 0.24032 rpn box loss: 0.02516 fast_rcnn class loss: 0.12771 fast_rcnn box loss: 0.47397) learning rate: 0.00021
[MaskRCNN] INFO    : Global step 70 (epoch 1/80): total loss: 3.86796 (rpn score loss: 0.17574 rpn box loss: 0.03412 fast_rcnn class loss: 0.13395 fast_rcnn box loss: 0.55011) learning rate: 0.00022
[MaskRCNN] INFO    : Global step 80 (epoch 1/80): total loss: 3.82501 (rpn score loss: 0.38071 rpn box loss: 0.03132 fast_rcnn class loss: 0.18896 fast_rcnn box loss: 0.29981) learning rate: 0.00024
[MaskRCNN] INFO    : Global step 90 (epoch 1/80): total loss: 4.02203 (rpn score loss: 0.69470 rpn box loss: 0.03337 fast_rcnn class loss: 0.25799 fast_rcnn box loss: 0.17383) learning rate: 0.00026
[MaskRCNN] INFO    : Global step 100 (epoch 1/80): total loss: 3.47784 (rpn score loss: 0.16611 rpn box loss: 0.01019 fast_rcnn class loss: 0.14563 fast_rcnn box loss: 0.35927) learning rate: 0.00028
[MaskRCNN] INFO    : Global step 110 (epoch 1/80): total loss: 3.51706 (rpn score loss: 0.13525 rpn box loss: 0.04794 fast_rcnn class loss: 0.09923 fast_rcnn box loss: 0.41204) learning rate: 0.00030
[MaskRCNN] INFO    : Global step 120 (epoch 1/80): total loss: 3.96364 (rpn score loss: 0.14839 rpn box loss: 0.01674 fast_rcnn class loss: 0.16103 fast_rcnn box loss: 0.65012) learning rate: 0.00031
[MaskRCNN] INFO    : Global step 130 (epoch 1/80): total loss: 3.77484 (rpn score loss: 0.13604 rpn box loss: 0.01103 fast_rcnn class loss: 0.19947 fast_rcnn box loss: 0.55928) learning rate: 0.00033
[MaskRCNN] INFO    : Global step 140 (epoch 1/80): total loss: 3.86374 (rpn score loss: 0.65349 rpn box loss: 0.04654 fast_rcnn class loss: 0.20532 fast_rcnn box loss: 0.09596) learning rate: 0.00035
[MaskRCNN] INFO    : Global step 150 (epoch 1/80): total loss: 3.59504 (rpn score loss: 0.12751 rpn box loss: 0.02448 fast_rcnn class loss: 0.10541 fast_rcnn box loss: 0.49047) learning rate: 0.00037
[MaskRCNN] INFO    : Global step 160 (epoch 1/80): total loss: 3.24085 (rpn score loss: 0.19776 rpn box loss: 0.03465 fast_rcnn class loss: 0.11691 fast_rcnn box loss: 0.10234) learning rate: 0.00039
[MaskRCNN] INFO    : Global step 170 (epoch 1/80): total loss: 3.53636 (rpn score loss: 0.13899 rpn box loss: 0.01613 fast_rcnn class loss: 0.14474 fast_rcnn box loss: 0.41501) learning rate: 0.00040
[MaskRCNN] INFO    : Global step 180 (epoch 1/80): total loss: 3.48006 (rpn score loss: 0.45270 rpn box loss: 0.02109 fast_rcnn class loss: 0.09376 fast_rcnn box loss: 0.13046) learning rate: 0.00042
[MaskRCNN] INFO    : Global step 190 (epoch 1/80): total loss: 3.79810 (rpn score loss: 0.18788 rpn box loss: 0.03119 fast_rcnn class loss: 0.14459 fast_rcnn box loss: 0.52509) learning rate: 0.00044
[MaskRCNN] INFO    : Global step 200 (epoch 1/80): total loss: 3.89601 (rpn score loss: 0.14355 rpn box loss: 0.04137 fast_rcnn class loss: 0.16103 fast_rcnn box loss: 0.62923) learning rate: 0.00046
[MaskRCNN] INFO    : Global step 210 (epoch 1/80): total loss: 3.88345 (rpn score loss: 0.33756 rpn box loss: 0.03372 fast_rcnn class loss: 0.15200 fast_rcnn box loss: 0.42600) learning rate: 0.00048
[MaskRCNN] INFO    : Global step 220 (epoch 1/80): total loss: 3.82357 (rpn score loss: 0.11313 rpn box loss: 0.07056 fast_rcnn class loss: 0.14843 fast_rcnn box loss: 0.54826) learning rate: 0.00049
[MaskRCNN] INFO    : Global step 230 (epoch 1/80): total loss: 3.29482 (rpn score loss: 0.26319 rpn box loss: 0.02980 fast_rcnn class loss: 0.08147 fast_rcnn box loss: 0.15555) learning rate: 0.00051
[MaskRCNN] INFO    : Global step 240 (epoch 1/80): total loss: 3.39277 (rpn score loss: 0.16259 rpn box loss: 0.04577 fast_rcnn class loss: 0.07898 fast_rcnn box loss: 0.28445) learning rate: 0.00053
[MaskRCNN] INFO    : Global step 250 (epoch 1/80): total loss: 3.74806 (rpn score loss: 0.10812 rpn box loss: 0.01249 fast_rcnn class loss: 0.13864 fast_rcnn box loss: 0.62587) learning rate: 0.00055
[MaskRCNN] INFO    : Global step 260 (epoch 1/80): total loss: 3.59578 (rpn score loss: 0.08890 rpn box loss: 0.02242 fast_rcnn class loss: 0.11853 fast_rcnn box loss: 0.51944) learning rate: 0.00057
[MaskRCNN] INFO    : Global step 270 (epoch 1/80): total loss: 3.38003 (rpn score loss: 0.14326 rpn box loss: 0.03004 fast_rcnn class loss: 0.12309 fast_rcnn box loss: 0.29980) learning rate: 0.00058
[MaskRCNN] INFO    : Global step 280 (epoch 1/80): total loss: 3.63258 (rpn score loss: 0.10672 rpn box loss: 0.01853 fast_rcnn class loss: 0.16141 fast_rcnn box loss: 0.51917) learning rate: 0.00060
[MaskRCNN] INFO    : Global step 290 (epoch 1/80): total loss: 3.50878 (rpn score loss: 0.07796 rpn box loss: 0.00937 fast_rcnn class loss: 0.10412 fast_rcnn box loss: 0.49216) learning rate: 0.00062
[MaskRCNN] INFO    : Global step 300 (epoch 1/80): total loss: 3.68461 (rpn score loss: 0.07386 rpn box loss: 0.02691 fast_rcnn class loss: 0.06788 fast_rcnn box loss: 0.65287) learning rate: 0.00064
[MaskRCNN] INFO    : Global step 310 (epoch 1/80): total loss: 3.79577 (rpn score loss: 0.12678 rpn box loss: 0.03136 fast_rcnn class loss: 0.17135 fast_rcnn box loss: 0.62608) learning rate: 0.00066
[MaskRCNN] INFO    : Global step 320 (epoch 1/80): total loss: 3.49812 (rpn score loss: 0.30382 rpn box loss: 0.01665 fast_rcnn class loss: 0.16422 fast_rcnn box loss: 0.23499) learning rate: 0.00067
[MaskRCNN] INFO    : Global step 330 (epoch 1/80): total loss: 3.39509 (rpn score loss: 0.14926 rpn box loss: 0.01127 fast_rcnn class loss: 0.18745 fast_rcnn box loss: 0.29657) learning rate: 0.00069
[MaskRCNN] INFO    : Global step 340 (epoch 1/80): total loss: 3.69726 (rpn score loss: 0.07150 rpn box loss: 0.04261 fast_rcnn class loss: 0.10565 fast_rcnn box loss: 0.59651) learning rate: 0.00071
[MaskRCNN] INFO    : Global step 350 (epoch 1/80): total loss: 3.66712 (rpn score loss: 0.25832 rpn box loss: 0.03146 fast_rcnn class loss: 0.20199 fast_rcnn box loss: 0.29746) learning rate: 0.00073
[MaskRCNN] INFO    : Global step 360 (epoch 1/80): total loss: 3.60247 (rpn score loss: 0.22289 rpn box loss: 0.03379 fast_rcnn class loss: 0.17936 fast_rcnn box loss: 0.38084) learning rate: 0.00075
[MaskRCNN] INFO    : Global step 370 (epoch 1/80): total loss: 4.08682 (rpn score loss: 0.65837 rpn box loss: 0.04571 fast_rcnn class loss: 0.26523 fast_rcnn box loss: 0.18315) learning rate: 0.00076
[MaskRCNN] INFO    : Global step 380 (epoch 1/80): total loss: 3.67545 (rpn score loss: 0.09191 rpn box loss: 0.01238 fast_rcnn class loss: 0.10415 fast_rcnn box loss: 0.59665) learning rate: 0.00078
[MaskRCNN] INFO    : Global step 390 (epoch 1/80): total loss: 3.40259 (rpn score loss: 0.15155 rpn box loss: 0.01665 fast_rcnn class loss: 0.11172 fast_rcnn box loss: 0.35199) learning rate: 0.00080
[MaskRCNN] INFO    : Global step 400 (epoch 1/80): total loss: 3.74400 (rpn score loss: 0.38868 rpn box loss: 0.03371 fast_rcnn class loss: 0.32566 fast_rcnn box loss: 0.17396) learning rate: 0.00082
[MaskRCNN] INFO    : Global step 410 (epoch 1/80): total loss: 3.49370 (rpn score loss: 0.24986 rpn box loss: 0.02272 fast_rcnn class loss: 0.23084 fast_rcnn box loss: 0.22342) learning rate: 0.00084
[MaskRCNN] INFO    : Global step 420 (epoch 1/80): total loss: 3.65861 (rpn score loss: 0.25015 rpn box loss: 0.02878 fast_rcnn class loss: 0.20944 fast_rcnn box loss: 0.23010) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 430 (epoch 1/80): total loss: 3.89399 (rpn score loss: 0.41117 rpn box loss: 0.03442 fast_rcnn class loss: 0.39298 fast_rcnn box loss: 0.20511) learning rate: 0.00087
[MaskRCNN] INFO    : Global step 440 (epoch 1/80): total loss: 3.63664 (rpn score loss: 0.24047 rpn box loss: 0.02712 fast_rcnn class loss: 0.24869 fast_rcnn box loss: 0.32587) learning rate: 0.00089
[MaskRCNN] INFO    : Global step 450 (epoch 1/80): total loss: 3.79023 (rpn score loss: 0.39219 rpn box loss: 0.02427 fast_rcnn class loss: 0.43519 fast_rcnn box loss: 0.10614) learning rate: 0.00091
[MaskRCNN] INFO    : Global step 460 (epoch 1/80): total loss: 3.58539 (rpn score loss: 0.23256 rpn box loss: 0.04284 fast_rcnn class loss: 0.23157 fast_rcnn box loss: 0.29416) learning rate: 0.00093
[MaskRCNN] INFO    : Global step 470 (epoch 1/80): total loss: 3.92394 (rpn score loss: 0.39608 rpn box loss: 0.02867 fast_rcnn class loss: 0.39164 fast_rcnn box loss: 0.32709) learning rate: 0.00094
[MaskRCNN] INFO    : Global step 480 (epoch 1/80): total loss: 3.60576 (rpn score loss: 0.16243 rpn box loss: 0.02292 fast_rcnn class loss: 0.20319 fast_rcnn box loss: 0.43584) learning rate: 0.00096
[MaskRCNN] INFO    : Global step 490 (epoch 1/80): total loss: 3.61445 (rpn score loss: 0.04854 rpn box loss: 0.00853 fast_rcnn class loss: 0.11651 fast_rcnn box loss: 0.61977) learning rate: 0.00098
[MaskRCNN] INFO    : Global step 500 (epoch 1/80): total loss: 3.56281 (rpn score loss: 0.10191 rpn box loss: 0.01263 fast_rcnn class loss: 0.12622 fast_rcnn box loss: 0.46735) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 510 (epoch 1/80): total loss: 3.32369 (rpn score loss: 0.05528 rpn box loss: 0.01359 fast_rcnn class loss: 0.09569 fast_rcnn box loss: 0.38604) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 520 (epoch 1/80): total loss: 3.45664 (rpn score loss: 0.07883 rpn box loss: 0.04745 fast_rcnn class loss: 0.10325 fast_rcnn box loss: 0.43433) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 530 (epoch 1/80): total loss: 3.53231 (rpn score loss: 0.06332 rpn box loss: 0.01996 fast_rcnn class loss: 0.17204 fast_rcnn box loss: 0.45074) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 540 (epoch 1/80): total loss: 3.44095 (rpn score loss: 0.07283 rpn box loss: 0.01560 fast_rcnn class loss: 0.14119 fast_rcnn box loss: 0.42358) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 550 (epoch 1/80): total loss: 3.83361 (rpn score loss: 0.20690 rpn box loss: 0.03243 fast_rcnn class loss: 0.34327 fast_rcnn box loss: 0.45627) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 560 (epoch 1/80): total loss: 3.31315 (rpn score loss: 0.13814 rpn box loss: 0.01340 fast_rcnn class loss: 0.15641 fast_rcnn box loss: 0.22113) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 570 (epoch 1/80): total loss: 3.63777 (rpn score loss: 0.15626 rpn box loss: 0.03240 fast_rcnn class loss: 0.20851 fast_rcnn box loss: 0.34752) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 580 (epoch 1/80): total loss: 3.50107 (rpn score loss: 0.08407 rpn box loss: 0.04541 fast_rcnn class loss: 0.11819 fast_rcnn box loss: 0.45455) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 590 (epoch 1/80): total loss: 3.61385 (rpn score loss: 0.08391 rpn box loss: 0.02316 fast_rcnn class loss: 0.12032 fast_rcnn box loss: 0.56855) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 600 (epoch 1/80): total loss: 4.20732 (rpn score loss: 0.64709 rpn box loss: 0.05505 fast_rcnn class loss: 0.45570 fast_rcnn box loss: 0.20300) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 610 (epoch 1/80): total loss: 3.24434 (rpn score loss: 0.07851 rpn box loss: 0.00819 fast_rcnn class loss: 0.08293 fast_rcnn box loss: 0.38932) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 620 (epoch 1/80): total loss: 3.54244 (rpn score loss: 0.03246 rpn box loss: 0.01119 fast_rcnn class loss: 0.15867 fast_rcnn box loss: 0.55238) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 630 (epoch 1/80): total loss: 3.70553 (rpn score loss: 0.25289 rpn box loss: 0.03375 fast_rcnn class loss: 0.28609 fast_rcnn box loss: 0.26534) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 640 (epoch 1/80): total loss: 3.46454 (rpn score loss: 0.04449 rpn box loss: 0.01826 fast_rcnn class loss: 0.11437 fast_rcnn box loss: 0.56433) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 650 (epoch 1/80): total loss: 3.44091 (rpn score loss: 0.06358 rpn box loss: 0.03270 fast_rcnn class loss: 0.13000 fast_rcnn box loss: 0.47271) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 660 (epoch 1/80): total loss: 3.45321 (rpn score loss: 0.03651 rpn box loss: 0.01053 fast_rcnn class loss: 0.11414 fast_rcnn box loss: 0.47083) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 670 (epoch 1/80): total loss: 3.39485 (rpn score loss: 0.05236 rpn box loss: 0.02159 fast_rcnn class loss: 0.11029 fast_rcnn box loss: 0.44708) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 680 (epoch 1/80): total loss: 3.55119 (rpn score loss: 0.18477 rpn box loss: 0.03453 fast_rcnn class loss: 0.22440 fast_rcnn box loss: 0.38985) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 690 (epoch 1/80): total loss: 3.46719 (rpn score loss: 0.17302 rpn box loss: 0.01830 fast_rcnn class loss: 0.22159 fast_rcnn box loss: 0.35161) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 700 (epoch 1/80): total loss: 3.68496 (rpn score loss: 0.25162 rpn box loss: 0.02782 fast_rcnn class loss: 0.31603 fast_rcnn box loss: 0.35100) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 710 (epoch 1/80): total loss: 3.59235 (rpn score loss: 0.09769 rpn box loss: 0.03143 fast_rcnn class loss: 0.21235 fast_rcnn box loss: 0.50969) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 720 (epoch 1/80): total loss: 3.68763 (rpn score loss: 0.21301 rpn box loss: 0.03312 fast_rcnn class loss: 0.28394 fast_rcnn box loss: 0.42362) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 730 (epoch 1/80): total loss: 3.48760 (rpn score loss: 0.08802 rpn box loss: 0.02152 fast_rcnn class loss: 0.19061 fast_rcnn box loss: 0.48414) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 740 (epoch 1/80): total loss: 3.44055 (rpn score loss: 0.15659 rpn box loss: 0.02653 fast_rcnn class loss: 0.17202 fast_rcnn box loss: 0.37699) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 750 (epoch 1/80): total loss: 3.30072 (rpn score loss: 0.05789 rpn box loss: 0.01760 fast_rcnn class loss: 0.11668 fast_rcnn box loss: 0.39960) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 760 (epoch 1/80): total loss: 3.46550 (rpn score loss: 0.09144 rpn box loss: 0.01609 fast_rcnn class loss: 0.17086 fast_rcnn box loss: 0.45357) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 770 (epoch 1/80): total loss: 3.62472 (rpn score loss: 0.09353 rpn box loss: 0.06809 fast_rcnn class loss: 0.18071 fast_rcnn box loss: 0.46898) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 780 (epoch 1/80): total loss: 3.61292 (rpn score loss: 0.30926 rpn box loss: 0.03466 fast_rcnn class loss: 0.32483 fast_rcnn box loss: 0.16482) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 790 (epoch 1/80): total loss: 3.40686 (rpn score loss: 0.08852 rpn box loss: 0.02564 fast_rcnn class loss: 0.17241 fast_rcnn box loss: 0.41358) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 800 (epoch 1/80): total loss: 3.38520 (rpn score loss: 0.04533 rpn box loss: 0.02320 fast_rcnn class loss: 0.13714 fast_rcnn box loss: 0.40360) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 810 (epoch 1/80): total loss: 3.65710 (rpn score loss: 0.41762 rpn box loss: 0.02824 fast_rcnn class loss: 0.35193 fast_rcnn box loss: 0.13672) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 820 (epoch 1/80): total loss: 4.03538 (rpn score loss: 0.40264 rpn box loss: 0.02181 fast_rcnn class loss: 0.57341 fast_rcnn box loss: 0.29650) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 830 (epoch 1/80): total loss: 3.27442 (rpn score loss: 0.06124 rpn box loss: 0.01841 fast_rcnn class loss: 0.13731 fast_rcnn box loss: 0.42416) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 840 (epoch 1/80): total loss: 3.31723 (rpn score loss: 0.04174 rpn box loss: 0.01203 fast_rcnn class loss: 0.19007 fast_rcnn box loss: 0.40364) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 850 (epoch 1/80): total loss: 3.51194 (rpn score loss: 0.15601 rpn box loss: 0.02985 fast_rcnn class loss: 0.24143 fast_rcnn box loss: 0.37560) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 860 (epoch 1/80): total loss: 3.31428 (rpn score loss: 0.06729 rpn box loss: 0.03045 fast_rcnn class loss: 0.15430 fast_rcnn box loss: 0.40920) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 870 (epoch 1/80): total loss: 3.15262 (rpn score loss: 0.03966 rpn box loss: 0.01307 fast_rcnn class loss: 0.13025 fast_rcnn box loss: 0.34433) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 880 (epoch 1/80): total loss: 3.67152 (rpn score loss: 0.34548 rpn box loss: 0.02869 fast_rcnn class loss: 0.34608 fast_rcnn box loss: 0.13637) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 890 (epoch 1/80): total loss: 3.51194 (rpn score loss: 0.12095 rpn box loss: 0.02022 fast_rcnn class loss: 0.25010 fast_rcnn box loss: 0.40911) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 900 (epoch 1/80): total loss: 3.71499 (rpn score loss: 0.19847 rpn box loss: 0.02536 fast_rcnn class loss: 0.36255 fast_rcnn box loss: 0.36140) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 910 (epoch 1/80): total loss: 3.55288 (rpn score loss: 0.17814 rpn box loss: 0.02503 fast_rcnn class loss: 0.33894 fast_rcnn box loss: 0.31949) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 920 (epoch 1/80): total loss: 3.76785 (rpn score loss: 0.07082 rpn box loss: 0.03941 fast_rcnn class loss: 0.32042 fast_rcnn box loss: 0.50948) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 930 (epoch 1/80): total loss: 3.25494 (rpn score loss: 0.02962 rpn box loss: 0.00551 fast_rcnn class loss: 0.18392 fast_rcnn box loss: 0.38825) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 940 (epoch 1/80): total loss: 3.40572 (rpn score loss: 0.11224 rpn box loss: 0.04393 fast_rcnn class loss: 0.26890 fast_rcnn box loss: 0.36776) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 950 (epoch 1/80): total loss: 3.28233 (rpn score loss: 0.07506 rpn box loss: 0.01128 fast_rcnn class loss: 0.20046 fast_rcnn box loss: 0.37531) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 960 (epoch 1/80): total loss: 3.19567 (rpn score loss: 0.05232 rpn box loss: 0.02162 fast_rcnn class loss: 0.13239 fast_rcnn box loss: 0.35097) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 970 (epoch 1/80): total loss: 3.49493 (rpn score loss: 0.05453 rpn box loss: 0.01360 fast_rcnn class loss: 0.31856 fast_rcnn box loss: 0.42678) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 980 (epoch 1/80): total loss: 3.46868 (rpn score loss: 0.03224 rpn box loss: 0.03284 fast_rcnn class loss: 0.21213 fast_rcnn box loss: 0.54326) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 990 (epoch 1/80): total loss: 3.32783 (rpn score loss: 0.08340 rpn box loss: 0.02152 fast_rcnn class loss: 0.23354 fast_rcnn box loss: 0.32734) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 1000 (epoch 1/80): total loss: 3.45508 (rpn score loss: 0.17086 rpn box loss: 0.04418 fast_rcnn class loss: 0.19795 fast_rcnn box loss: 0.35976) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 1010 (epoch 1/80): total loss: 3.11894 (rpn score loss: 0.04871 rpn box loss: 0.00996 fast_rcnn class loss: 0.17435 fast_rcnn box loss: 0.31044) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 1020 (epoch 1/80): total loss: 3.16048 (rpn score loss: 0.09671 rpn box loss: 0.01771 fast_rcnn class loss: 0.14560 fast_rcnn box loss: 0.27713) learning rate: 0.00100
[INFO] None
[MaskRCNN] INFO    : Epoch 1/80: loss: 3.11241 learning rate: 0.00100 Time taken: 0:08:11.191832 ETA: 10:46:44.154752
[MaskRCNN] INFO    : Saving checkpoints for epoch 1 into /workspace/fyp/experiments/maskrcnn_fyp/model.epoch-1.tlt.
INFO:tensorflow:Loss for final step: 3.3392217.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :     Start evaluation cycle 01
[MaskRCNN] INFO    : =================================
    
[MaskRCNN] INFO    : [eval] AMP is activated - Experiment Feature
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpd4brfspr', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
    auto_mixed_precision: ON
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc679e89310>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Loading weights from /workspace/fyp/experiments/maskrcnn_fyp/model.epoch-1.tlt
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
[MaskRCNN] INFO    : [*] Limiting the amount of sample to: 84
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO    : [Inference Compute Statistics] 504.3 GFLOPS/image
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4brfspr/model.ckpt-1024
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[MaskRCNN] INFO    : Running inference on batch 001/042... -                Step Time: 7.9444s - Throughput: 0.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 002/042... -                Step Time: 0.0685s - Throughput: 29.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 003/042... -                Step Time: 0.0664s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 004/042... -                Step Time: 0.0662s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 005/042... -                Step Time: 0.0659s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 006/042... -                Step Time: 0.0665s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 007/042... -                Step Time: 0.0660s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 008/042... -                Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 009/042... -                Step Time: 0.0659s - Throughput: 30.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 010/042... -                Step Time: 0.0655s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 011/042... -                Step Time: 0.0670s - Throughput: 29.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 012/042... -                Step Time: 0.0660s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 013/042... -                Step Time: 0.0669s - Throughput: 29.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 014/042... -                Step Time: 0.0657s - Throughput: 30.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 015/042... -                Step Time: 0.0660s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 016/042... -                Step Time: 0.0658s - Throughput: 30.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 017/042... -                Step Time: 0.0662s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 018/042... -                Step Time: 0.0661s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 019/042... -                Step Time: 0.0678s - Throughput: 29.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 020/042... -                Step Time: 0.0765s - Throughput: 26.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 021/042... -                Step Time: 0.0755s - Throughput: 26.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 022/042... -                Step Time: 0.0747s - Throughput: 26.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 023/042... -                Step Time: 0.1209s - Throughput: 16.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 024/042... -                Step Time: 0.0722s - Throughput: 27.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 025/042... -                Step Time: 0.0696s - Throughput: 28.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 026/042... -                Step Time: 0.0859s - Throughput: 23.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 027/042... -                Step Time: 0.0728s - Throughput: 27.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 028/042... -                Step Time: 0.0706s - Throughput: 28.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 029/042... -                Step Time: 0.0763s - Throughput: 26.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 030/042... -                Step Time: 0.0704s - Throughput: 28.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 031/042... -                Step Time: 0.0661s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 032/042... -                Step Time: 0.0644s - Throughput: 31.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 033/042... -                Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 034/042... -                Step Time: 0.0664s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 035/042... -                Step Time: 0.0664s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 036/042... -                Step Time: 0.0661s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 037/042... -                Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 038/042... -                Step Time: 0.0655s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 039/042... -                Step Time: 0.0663s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 040/042... -                Step Time: 0.0656s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 041/042... -                Step Time: 0.0661s - Throughput: 30.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 042/042... -                Step Time: 0.0656s - Throughput: 30.5 imgs/s
[MaskRCNN] INFO    : Loading and preparing results...
[MaskRCNN] INFO    : 0/8400
[MaskRCNN] INFO    : 1000/8400
[MaskRCNN] INFO    : 2000/8400
[MaskRCNN] INFO    : 3000/8400
[MaskRCNN] INFO    : 4000/8400
[MaskRCNN] INFO    : 5000/8400
[MaskRCNN] INFO    : 6000/8400
[MaskRCNN] INFO    : 7000/8400
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=1.95s).
Accumulating evaluation results...
DONE (t=0.04s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.057
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.172
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.020
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.034
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.115
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.023
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.084
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.157
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.117
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.266
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=1.96s).
Accumulating evaluation results...
DONE (t=0.04s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.059
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.165
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.019
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.039
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.116
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.025
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.080
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.140
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.003
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.106
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.232

[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO    :          Evaluation Performance Summary          
[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

[MaskRCNN] INFO    : Average throughput: -1.0         samples/sec
[MaskRCNN] INFO    : Total processed steps:         42
[MaskRCNN] INFO    : Total processing time: 0.0h 19m 52s
[MaskRCNN] INFO    : ==================== Metrics ====================
[MaskRCNN] INFO    : AP: 0.056989938
[MaskRCNN] INFO    : AP50: 0.172287852
[MaskRCNN] INFO    : AP75: 0.019901801
[MaskRCNN] INFO    : APl: 0.115367487
[MaskRCNN] INFO    : APm: 0.033995900
[MaskRCNN] INFO    : APs: 0.000000000
[MaskRCNN] INFO    : ARl: 0.266374260
[MaskRCNN] INFO    : ARm: 0.116838045
[MaskRCNN] INFO    : ARmax1: 0.023229707
[MaskRCNN] INFO    : ARmax10: 0.084196888
[MaskRCNN] INFO    : ARmax100: 0.157167524
[MaskRCNN] INFO    : ARs: 0.000000000
[MaskRCNN] INFO    : mask_AP: 0.059233427
[MaskRCNN] INFO    : mask_AP50: 0.165124863
[MaskRCNN] INFO    : mask_AP75: 0.019104565
[MaskRCNN] INFO    : mask_APl: 0.115515165
[MaskRCNN] INFO    : mask_APm: 0.038957115
[MaskRCNN] INFO    : mask_APs: 0.000495050
[MaskRCNN] INFO    : mask_ARl: 0.231871352
[MaskRCNN] INFO    : mask_ARm: 0.105784059
[MaskRCNN] INFO    : mask_ARmax1: 0.024784110
[MaskRCNN] INFO    : mask_ARmax10: 0.079533681
[MaskRCNN] INFO    : mask_ARmax100: 0.139637306
[MaskRCNN] INFO    : mask_ARs: 0.002631579

[INFO] Evaluation metrics generated.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 02
[MaskRCNN] INFO    : =================================
    
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 516.6 GFLOPS/image
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpd4brfspr/model.ckpt-1024
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: Invalid URL '': No scheme supplied. Perhaps you meant https://?
Execution status: FAIL
root@fe2d81786c8a:/workspace# 

Sometimes it also shows:

SSLCertVerificationError: certificate verify failed (telemetry.metropolis.nvidia.com)
Execution status: FAIL

These errors interrupt training immediately.

I already set:

TAO_DISABLE_TELEMETRY=1

—but TAO TF1 still tries to contact the telemetry server and fails.


Questions

  1. Is this a known issue with TAO Toolkit 5.0.0-TF1?

  2. Is there an officially supported way to fully disable telemetry in the TF1 container?

  3. Why does TAO still fail even when telemetry is disabled?


Additional Information

The folder structure is correct and exists inside the container:

/workspace/fyp/specs/maskrcnn_train.prototxt
/workspace/fyp/experiments/maskrcnn_fyp

The same project worked before but stopped working once the telemetry error started.

GPU is available and recognized (nvidia-smi works inside container).


Thanks for any help—this is blocking my training.

You can ignore the warning info from telemetry. The training interrupt should be related to something else. I am afraid it is due to out-of-memory. Please try to narrow down via

  • set lower batch-size
  • set smaller input width/height
  • use less dataset
1 Like

this is my prototxt file:

seed: 123
use_amp: True
warmup_steps: 500
checkpoint: “/workspace/fyp/ngc_files/pretrained_instance_segmentation_vresnet50/resnet50.hdf5”
learning_rate_steps: “[27000,54000,68000]”
learning_rate_decay_levels:“[0.1, 0.02, 0.002]”
total_steps: 81680
num_epochs: 80
train_batch_size: 2
eval_batch_size: 2
num_steps_per_eval: 1021
momentum: 0.9
l2_weight_decay: 0.0001
l1_weight_decay: 0.0
warmup_learning_rate: 0.0001
init_learning_rate: 0.001
num_examples_per_epoch: 2042

data_config {
image_size: “(1024, 1024)”
augment_input_data: False
eval_samples: 84

training_file_pattern: “/workspace/fyp/tfrecords/train/*.tfrecord”
validation_file_pattern: “/workspace/fyp/tfrecords/val/*.tfrecord”
val_json_file: “/workspace/fyp/final_val.json”

num_classes: 2
skip_crowd_during_training: False
max_num_instances: 200

}
maskrcnn_config {
nlayers: 50
arch: “resnet”
freeze_bn: False
freeze_blocks: “”
gt_mask_size: 112

# RPN
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_min_size: 0.

# Proposal layer
batch_size_per_im: 512
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.

# Detection heads
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: “(10., 10., 5., 5.)”
include_mask: True
mrcnn_resolution: 28

# Training proposals
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7

# Evaluation proposals
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000

# FPN
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: “[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]”
anchor_scale: 8

# Loss weights
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0

}

Notice that I am using batch size = 2 for the rtx 2080 ti machine. I have tried batch_size = 1 buth the problem persisted ( trains one epoch, evaluate, starts with the next epoch and fails).

Please try to set lower image_size to narrow down.

Also use a smaller part of dataset to narrow down.

I have set the image size to (640,640) and batch_size=1. It is now running 12 epochs and then failing. Notice the each time it stops I am re running it, but the metrics are not appearing to be better ( I have now reached epoch 34 and it stopped and the AP is still 0.16)

Notice that the last epoch before it fails it gives me the follwoing

[MaskRCNN] INFO    : Epoch 33/80: loss: 2.23017 learning rate: 0.00002 Time taken: 0:05:45.959978 ETA: 4:31:00.118971
[MaskRCNN] INFO    : Saving checkpoints for epoch 33 into /workspace/fyp/experiments/experiments_fresh/model.epoch-33.tlt.
INFO:tensorflow:Loss for final step: 2.230174.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :     Start evaluation cycle 33
[MaskRCNN] INFO    : =================================

[MaskRCNN] INFO    : Loading weights from /workspace/fyp/experiments/experiments_fresh/model.epoch-33.tlt
loading annotations into memory…
Done (t=0.02s)
creating index…
index created!
[MaskRCNN] INFO    : [*] Limiting the amount of sample to: 84
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph…
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs…
[MaskRCNN] INFO    : [Inference Compute Statistics] 270.5 GFLOPS/image
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp53k0atok/model.ckpt-67386
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[MaskRCNN] INFO    : Running inference on batch 001/084… -                Step Time: 5.1304s - Throughput: 0.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 002/084… -                Step Time: 0.0331s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 003/084… -                Step Time: 0.0323s - Throughput: 31.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 004/084… -                Step Time: 0.0323s - Throughput: 30.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 005/084… -                Step Time: 0.0294s - Throughput: 34.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 006/084… -                Step Time: 0.0265s - Throughput: 37.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 007/084… -                Step Time: 0.0266s - Throughput: 37.6 imgs/s
[MaskRCNN] INFO    : Running inference on batch 008/084… -                Step Time: 0.0296s - Throughput: 33.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 009/084… -                Step Time: 0.0279s - Throughput: 35.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 010/084… -                Step Time: 0.0264s - Throughput: 37.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 011/084… -                Step Time: 0.0268s - Throughput: 37.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 012/084… -                Step Time: 0.0269s - Throughput: 37.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 013/084… -                Step Time: 0.0265s - Throughput: 37.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 014/084… -                Step Time: 0.0265s - Throughput: 37.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 015/084… -                Step Time: 0.0359s - Throughput: 27.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 016/084… -                Step Time: 0.0274s - Throughput: 36.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 017/084… -                Step Time: 0.0265s - Throughput: 37.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 018/084… -                Step Time: 0.0264s - Throughput: 37.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 019/084… -                Step Time: 0.0273s - Throughput: 36.6 imgs/s
[MaskRCNN] INFO    : Running inference on batch 020/084… -                Step Time: 0.0264s - Throughput: 37.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 021/084… -                Step Time: 0.0267s - Throughput: 37.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 022/084… -                Step Time: 0.0284s - Throughput: 35.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 023/084… -                Step Time: 0.0281s - Throughput: 35.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 024/084… -                Step Time: 0.0274s - Throughput: 36.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 025/084… -                Step Time: 0.0266s - Throughput: 37.6 imgs/s
[MaskRCNN] INFO    : Running inference on batch 026/084… -                Step Time: 0.0264s - Throughput: 37.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 027/084… -                Step Time: 0.0267s - Throughput: 37.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 028/084… -                Step Time: 0.0275s - Throughput: 36.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 029/084… -                Step Time: 0.0275s - Throughput: 36.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 030/084… -                Step Time: 0.0275s - Throughput: 36.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 031/084… -                Step Time: 0.0270s - Throughput: 37.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 032/084… -                Step Time: 0.0266s - Throughput: 37.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 033/084… -                Step Time: 0.0263s - Throughput: 38.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 034/084… -                Step Time: 0.0268s - Throughput: 37.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 035/084… -                Step Time: 0.0265s - Throughput: 37.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 036/084… -                Step Time: 0.0264s - Throughput: 37.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 037/084… -                Step Time: 0.0277s - Throughput: 36.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 038/084… -                Step Time: 0.0274s - Throughput: 36.6 imgs/s
[MaskRCNN] INFO    : Running inference on batch 039/084… -                Step Time: 0.0272s - Throughput: 36.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 040/084… -                Step Time: 0.0265s - Throughput: 37.7 imgs/s
[MaskRCNN] INFO    : Running inference on batch 041/084… -                Step Time: 0.0261s - Throughput: 38.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 042/084… -                Step Time: 0.0265s - Throughput: 37.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 043/084… -                Step Time: 0.0270s - Throughput: 37.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 044/084… -                Step Time: 0.0277s - Throughput: 36.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 045/084… -                Step Time: 0.0283s - Throughput: 35.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 046/084… -                Step Time: 0.0271s - Throughput: 36.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 047/084… -                Step Time: 0.0292s - Throughput: 34.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 048/084… -                Step Time: 0.0319s - Throughput: 31.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 049/084… -                Step Time: 0.0419s - Throughput: 23.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 050/084… -                Step Time: 0.0402s - Throughput: 24.9 imgs/s
[MaskRCNN] INFO    : Running inference on batch 051/084… -                Step Time: 0.0388s - Throughput: 25.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 052/084… -                Step Time: 0.0310s - Throughput: 32.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 053/084… -                Step Time: 0.0447s - Throughput: 22.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 054/084… -                Step Time: 0.0703s - Throughput: 14.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 055/084… -                Step Time: 0.0409s - Throughput: 24.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 056/084… -                Step Time: 0.0332s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 057/084… -                Step Time: 0.0290s - Throughput: 34.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 058/084… -                Step Time: 0.0349s - Throughput: 28.6 imgs/s
[MaskRCNN] INFO    : Running inference on batch 059/084… -                Step Time: 0.0333s - Throughput: 30.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 060/084… -                Step Time: 0.0320s - Throughput: 31.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 061/084… -                Step Time: 0.0301s - Throughput: 33.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 062/084… -                Step Time: 0.0327s - Throughput: 30.6 imgs/s
[MaskRCNN] INFO    : Running inference on batch 063/084… -                Step Time: 0.0426s - Throughput: 23.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 064/084… -                Step Time: 0.0347s - Throughput: 28.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 065/084… -                Step Time: 0.0302s - Throughput: 33.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 066/084… -                Step Time: 0.0331s - Throughput: 30.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 067/084… -                Step Time: 0.0481s - Throughput: 20.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 068/084… -                Step Time: 0.0321s - Throughput: 31.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 069/084… -                Step Time: 0.0308s - Throughput: 32.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 070/084… -                Step Time: 0.0287s - Throughput: 34.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 071/084… -                Step Time: 0.0269s - Throughput: 37.2 imgs/s
[MaskRCNN] INFO    : Running inference on batch 072/084… -                Step Time: 0.0268s - Throughput: 37.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 073/084… -                Step Time: 0.0268s - Throughput: 37.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 074/084… -                Step Time: 0.0261s - Throughput: 38.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 075/084… -                Step Time: 0.0303s - Throughput: 33.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 076/084… -                Step Time: 0.0264s - Throughput: 37.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 077/084… -                Step Time: 0.0279s - Throughput: 35.8 imgs/s
[MaskRCNN] INFO    : Running inference on batch 078/084… -                Step Time: 0.0270s - Throughput: 37.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 079/084… -                Step Time: 0.0268s - Throughput: 37.3 imgs/s
[MaskRCNN] INFO    : Running inference on batch 080/084… -                Step Time: 0.0277s - Throughput: 36.1 imgs/s
[MaskRCNN] INFO    : Running inference on batch 081/084… -                Step Time: 0.0270s - Throughput: 37.0 imgs/s
[MaskRCNN] INFO    : Running inference on batch 082/084… -                Step Time: 0.0267s - Throughput: 37.4 imgs/s
[MaskRCNN] INFO    : Running inference on batch 083/084… -                Step Time: 0.0266s - Throughput: 37.5 imgs/s
[MaskRCNN] INFO    : Running inference on batch 084/084… -                Step Time: 0.0264s - Throughput: 37.9 imgs/s
[MaskRCNN] INFO    : Loading and preparing results…
[MaskRCNN] INFO    : 0/8400
[MaskRCNN] INFO    : 1000/8400
[MaskRCNN] INFO    : 2000/8400
[MaskRCNN] INFO    : 3000/8400
[MaskRCNN] INFO    : 4000/8400
[MaskRCNN] INFO    : 5000/8400
[MaskRCNN] INFO    : 6000/8400
[MaskRCNN] INFO    : 7000/8400
[MaskRCNN] INFO    : 8000/8400
creating index…
index created!
Running per image evaluation…
Evaluate annotation type bbox
DONE (t=2.38s).
Accumulating evaluation results…
DONE (t=0.05s).
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.167
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.392
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.113
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.121
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.278
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.042
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.151
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.274
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.242
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.378
Running per image evaluation…
Evaluate annotation type segm
DONE (t=2.50s).
Accumulating evaluation results…
DONE (t=0.04s).
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.152
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.379
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.100
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.112
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.250
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.038
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.137
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.234
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.206
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.323

[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO    :          Evaluation Performance Summary
[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

[MaskRCNN] INFO    : Average throughput: -1.0         samples/sec
[MaskRCNN] INFO    : Total processed steps:         84
[MaskRCNN] INFO    : Total processing time: 0.0h 47m 17s
[MaskRCNN] INFO    : ==================== Metrics ====================
[MaskRCNN] INFO    : AP: 0.166794404
[MaskRCNN] INFO    : AP50: 0.392264277
[MaskRCNN] INFO    : AP75: 0.113416187
[MaskRCNN] INFO    : APl: 0.278367639
[MaskRCNN] INFO    : APm: 0.120881885
[MaskRCNN] INFO    : APs: 0.000000000
[MaskRCNN] INFO    : ARl: 0.378362566
[MaskRCNN] INFO    : ARm: 0.241902307
[MaskRCNN] INFO    : ARmax1: 0.042055268
[MaskRCNN] INFO    : ARmax10: 0.150949910
[MaskRCNN] INFO    : ARmax100: 0.274265975
[MaskRCNN] INFO    : ARs: 0.000000000
[MaskRCNN] INFO    : mask_AP: 0.151974469
[MaskRCNN] INFO    : mask_AP50: 0.378736854
[MaskRCNN] INFO    : mask_AP75: 0.100144528
[MaskRCNN] INFO    : mask_APl: 0.250128537
[MaskRCNN] INFO    : mask_APm: 0.111917265
[MaskRCNN] INFO    : mask_APs: 0.000000000
[MaskRCNN] INFO    : mask_ARl: 0.322514623
[MaskRCNN] INFO    : mask_ARm: 0.205784068
[MaskRCNN] INFO    : mask_ARmax1: 0.037996545
[MaskRCNN] INFO    : mask_ARmax10: 0.136701211
[MaskRCNN] INFO    : mask_ARmax100: 0.233506039
[MaskRCNN] INFO    : mask_ARs: 0.000000000

[INFO] Evaluation metrics generated.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 34
[MaskRCNN] INFO    : =================================

WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation tf.image.convert_image_dtype will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph…
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS… Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs…
[MaskRCNN] INFO    : [Training Compute Statistics] 282.9 GFLOPS/image
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp53k0atok/model.ckpt-67386
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: HTTPSConnectionPool(host=‘telemetry.metropolis.nvidia.com’, port=443): Max retries exceeded with url: /api/v1/telemetry (Caused by SSLError(SSLCertVerificationError(1, ‘[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)’)))
Execution status: FAIL
root@f5b8666e1207:/workspace#

Not sure the baseline of your training dataset.
Also, the training hyper-parameters play an important role on the AP. Input-size/epoch/etc can affect the result.

Long time ago, I shared a spec file for running on COCO dataset. Poor metric results after retraining maskrcnn using TLT notebook - #13 by Morganh.

Deal thank you so much I will look for it, but concerning the failing problem, are there any recommendations /solution to follow?

Please try to use less dataset to narrow down the OOM issue. Also try to use a machine with more gpu memory.