Mask R-CNN hangs during training using custom made tfrecords

mohsen.zardadi · April 13, 2021, 5:59pm

I am trying to train Mask R-CNN using ‘TLT MAskRCNN example use case’ and Jupyter notebook hangs during training!
There is no error message or warning! I’ve downloaded the training and validation instance segmentation tfrecords from CVAT!

I am using this docker container: nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

I have tested the container using the COCO dataset and it works just fine but switching to the new dataset will freeze the Jupyter notebook during training.

For multi-GPU, change --gpus based on your machine.
2021-04-12 20:36:35.245585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:35.281905: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Horovod successfully initialized …
[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================

[MaskRCNN] INFO : Using Dataset Sharding with Horovod
2021-04-12 20:36:46.176423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-04-12 20:36:46.214550: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-04-12 20:36:46.224296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-04-12 20:36:46.224364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:46.225829: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:36:46.227101: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-04-12 20:36:46.227439: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-04-12 20:36:46.230000: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-04-12 20:36:46.231154: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-04-12 20:36:46.235485: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-04-12 20:36:46.244740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-04-12 20:36:46.255034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-04-12 20:36:46.255105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:46.256642: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:36:46.258116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-04-12 20:36:46.258464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-04-12 20:36:46.260074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-04-12 20:36:46.261275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-04-12 20:36:46.264812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-04-12 20:36:46.267281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_6/
2021-04-12 20:36:50.084480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-04-12 20:36:50.084583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:50.084743: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:36:50.084794: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-04-12 20:36:50.084846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-04-12 20:36:50.084881: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-04-12 20:36:50.084908: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-04-12 20:36:50.084936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-04-12 20:36:50.087958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-04-12 20:36:50.087998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:50.201681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-04-12 20:36:50.201912: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:50.202349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:36:50.202396: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-04-12 20:36:50.202434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-04-12 20:36:50.202468: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-04-12 20:36:50.202503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-04-12 20:36:50.202538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-04-12 20:36:50.206095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-04-12 20:36:50.206146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:50.587207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-12 20:36:50.587283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-04-12 20:36:50.587292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-04-12 20:36:50.591190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22514 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:0a:00.0, compute capability: 7.5)
2021-04-12 20:36:50.637736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-12 20:36:50.637790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-04-12 20:36:50.637798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-04-12 20:36:50.641388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 19290 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 308.2 GFLOPS/image
2021-04-12 20:36:58.019263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-04-12 20:36:58.019385: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:36:58.019514: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:36:58.019540: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-04-12 20:36:58.019561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-04-12 20:36:58.019583: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-04-12 20:36:58.019603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-04-12 20:36:58.019624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-04-12 20:36:58.020916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-04-12 20:36:58.020977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-12 20:36:58.020986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-04-12 20:36:58.020991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-04-12 20:36:58.022033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 19290 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)
2021-04-12 20:37:01.515616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-04-12 20:37:01.515730: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-04-12 20:37:01.515828: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:37:01.515854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-04-12 20:37:01.515873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-04-12 20:37:01.515891: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-04-12 20:37:01.515909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-04-12 20:37:01.515928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-04-12 20:37:01.516951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-04-12 20:37:01.517002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-12 20:37:01.517010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-04-12 20:37:01.517016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-04-12 20:37:01.518064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22514 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:0a:00.0, compute capability: 7.5)
2021-04-12 20:37:03.052306: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2021-04-12 20:37:06.632883: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
fatal: Not a git repository (or any parent up to mount point /workspace/server)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /workspace/server)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[MaskRCNN] INFO : ============================ GIT REPOSITORY ============================
[MaskRCNN] INFO : BRANCH NAME:
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[MaskRCNN] INFO : ============================ MODEL STATISTICS ===========================
[MaskRCNN] INFO : # Model Weights: 44,023,253
[MaskRCNN] INFO : # Trainable Weights: 43,970,133
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[MaskRCNN] INFO : ============================ TRAINABLE VARIABLES ========================
[MaskRCNN] INFO : [#0001] conv1/kernel:0 => (7, 7, 3, 64)
[MaskRCNN] INFO : [#0002] bn_conv1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0003] bn_conv1/beta:0 => (64,)
[MaskRCNN] INFO : [#0004] block_1a_conv_1/kernel:0 => (1, 1, 64, 64)
[MaskRCNN] INFO : [#0005] block_1a_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0006] block_1a_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0007] block_1a_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0008] block_1a_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0009] block_1a_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0010] block_1a_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0011] block_1a_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0012] block_1a_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0013] block_1a_conv_shortcut/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0014] block_1a_bn_shortcut/gamma:0 => (256,)
[MaskRCNN] INFO : [#0015] block_1a_bn_shortcut/beta:0 => (256,)
[MaskRCNN] INFO : [#0016] block_1b_conv_1/kernel:0 => (1, 1, 256, 64)
[MaskRCNN] INFO : [#0017] block_1b_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0018] block_1b_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0019] block_1b_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0020] block_1b_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0021] block_1b_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0022] block_1b_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0023] block_1b_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0024] block_1b_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0025] block_1c_conv_1/kernel:0 => (1, 1, 256, 64)
[MaskRCNN] INFO : [#0026] block_1c_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0027] block_1c_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0028] block_1c_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0029] block_1c_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0030] block_1c_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0031] block_1c_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0032] block_1c_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0033] block_1c_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0034] block_2a_conv_1/kernel:0 => (1, 1, 256, 128)
[MaskRCNN] INFO : [#0035] block_2a_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0036] block_2a_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0037] block_2a_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0038] block_2a_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0039] block_2a_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0040] block_2a_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0041] block_2a_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0042] block_2a_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0043] block_2a_conv_shortcut/kernel:0 => (1, 1, 256, 512)
[MaskRCNN] INFO : [#0044] block_2a_bn_shortcut/gamma:0 => (512,)
[MaskRCNN] INFO : [#0045] block_2a_bn_shortcut/beta:0 => (512,)
[MaskRCNN] INFO : [#0046] block_2b_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0047] block_2b_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0048] block_2b_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0049] block_2b_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0050] block_2b_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0051] block_2b_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0052] block_2b_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0053] block_2b_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0054] block_2b_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0055] block_2c_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0056] block_2c_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0057] block_2c_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0058] block_2c_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0059] block_2c_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0060] block_2c_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0061] block_2c_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0062] block_2c_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0063] block_2c_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0064] block_2d_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0065] block_2d_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0066] block_2d_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0067] block_2d_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0068] block_2d_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0069] block_2d_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0070] block_2d_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0071] block_2d_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0072] block_2d_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0073] block_3a_conv_1/kernel:0 => (1, 1, 512, 256)
[MaskRCNN] INFO : [#0074] block_3a_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0075] block_3a_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0076] block_3a_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0077] block_3a_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0078] block_3a_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0079] block_3a_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0080] block_3a_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0081] block_3a_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0082] block_3a_conv_shortcut/kernel:0 => (1, 1, 512, 1024)
[MaskRCNN] INFO : [#0083] block_3a_bn_shortcut/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0084] block_3a_bn_shortcut/beta:0 => (1024,)
[MaskRCNN] INFO : [#0085] block_3b_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0086] block_3b_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0087] block_3b_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0088] block_3b_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0089] block_3b_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0090] block_3b_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0091] block_3b_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0092] block_3b_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0093] block_3b_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0094] block_3c_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0095] block_3c_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0096] block_3c_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0097] block_3c_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0098] block_3c_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0099] block_3c_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0100] block_3c_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0101] block_3c_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0102] block_3c_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0103] block_3d_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0104] block_3d_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0105] block_3d_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0106] block_3d_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0107] block_3d_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0108] block_3d_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0109] block_3d_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0110] block_3d_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0111] block_3d_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0112] block_3e_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0113] block_3e_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0114] block_3e_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0115] block_3e_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0116] block_3e_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0117] block_3e_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0118] block_3e_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0119] block_3e_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0120] block_3e_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0121] block_3f_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0122] block_3f_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0123] block_3f_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0124] block_3f_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0125] block_3f_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0126] block_3f_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0127] block_3f_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0128] block_3f_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0129] block_3f_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0130] block_4a_conv_1/kernel:0 => (1, 1, 1024, 512)
[MaskRCNN] INFO : [#0131] block_4a_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0132] block_4a_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0133] block_4a_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0134] block_4a_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0135] block_4a_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0136] block_4a_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0137] block_4a_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0138] block_4a_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0139] block_4a_conv_shortcut/kernel:0 => (1, 1, 1024, 2048)
[MaskRCNN] INFO : [#0140] block_4a_bn_shortcut/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0141] block_4a_bn_shortcut/beta:0 => (2048,)
[MaskRCNN] INFO : [#0142] block_4b_conv_1/kernel:0 => (1, 1, 2048, 512)
[MaskRCNN] INFO : [#0143] block_4b_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0144] block_4b_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0145] block_4b_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0146] block_4b_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0147] block_4b_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0148] block_4b_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0149] block_4b_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0150] block_4b_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0151] block_4c_conv_1/kernel:0 => (1, 1, 2048, 512)
[MaskRCNN] INFO : [#0152] block_4c_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0153] block_4c_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0154] block_4c_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0155] block_4c_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0156] block_4c_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0157] block_4c_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0158] block_4c_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0159] block_4c_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0160] fpn/l2/kernel:0 => (1, 1, 256, 256)
[MaskRCNN] INFO : [#0161] fpn/l2/bias:0 => (256,)
[MaskRCNN] INFO : [#0162] fpn/l3/kernel:0 => (1, 1, 512, 256)
[MaskRCNN] INFO : [#0163] fpn/l3/bias:0 => (256,)
[MaskRCNN] INFO : [#0164] fpn/l4/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0165] fpn/l4/bias:0 => (256,)
[MaskRCNN] INFO : [#0166] fpn/l5/kernel:0 => (1, 1, 2048, 256)
[MaskRCNN] INFO : [#0167] fpn/l5/bias:0 => (256,)
[MaskRCNN] INFO : [#0168] fpn/post_hoc_d2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0169] fpn/post_hoc_d2/bias:0 => (256,)
[MaskRCNN] INFO : [#0170] fpn/post_hoc_d3/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0171] fpn/post_hoc_d3/bias:0 => (256,)
[MaskRCNN] INFO : [#0172] fpn/post_hoc_d4/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0173] fpn/post_hoc_d4/bias:0 => (256,)
[MaskRCNN] INFO : [#0174] fpn/post_hoc_d5/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0175] fpn/post_hoc_d5/bias:0 => (256,)
[MaskRCNN] INFO : [#0176] rpn_head/rpn/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0177] rpn_head/rpn/bias:0 => (256,)
[MaskRCNN] INFO : [#0178] rpn_head/rpn-class/kernel:0 => (1, 1, 256, 3)
[MaskRCNN] INFO : [#0179] rpn_head/rpn-class/bias:0 => (3,)
[MaskRCNN] INFO : [#0180] rpn_head/rpn-box/kernel:0 => (1, 1, 256, 12)
[MaskRCNN] INFO : [#0181] rpn_head/rpn-box/bias:0 => (12,)
[MaskRCNN] INFO : [#0182] box_head/fc6/kernel:0 => (12544, 1024)
[MaskRCNN] INFO : [#0183] box_head/fc6/bias:0 => (1024,)
[MaskRCNN] INFO : [#0184] box_head/fc7/kernel:0 => (1024, 1024)
[MaskRCNN] INFO : [#0185] box_head/fc7/bias:0 => (1024,)
[MaskRCNN] INFO : [#0186] box_head/class-predict/kernel:0 => (1024, 1)
[MaskRCNN] INFO : [#0187] box_head/class-predict/bias:0 => (1,)
[MaskRCNN] INFO : [#0188] box_head/box-predict/kernel:0 => (1024, 4)
[MaskRCNN] INFO : [#0189] box_head/box-predict/bias:0 => (4,)
[MaskRCNN] INFO : [#0190] mask_head/mask-conv-l0/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0191] mask_head/mask-conv-l0/bias:0 => (256,)
[MaskRCNN] INFO : [#0192] mask_head/mask-conv-l1/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0193] mask_head/mask-conv-l1/bias:0 => (256,)
[MaskRCNN] INFO : [#0194] mask_head/mask-conv-l2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0195] mask_head/mask-conv-l2/bias:0 => (256,)
[MaskRCNN] INFO : [#0196] mask_head/mask-conv-l3/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0197] mask_head/mask-conv-l3/bias:0 => (256,)
[MaskRCNN] INFO : [#0198] mask_head/conv5-mask/kernel:0 => (2, 2, 256, 256)
[MaskRCNN] INFO : [#0199] mask_head/conv5-mask/bias:0 => (256,)
[MaskRCNN] INFO : [#0200] mask_head/mask_fcn_logits/kernel:0 => (1, 1, 256, 1)
[MaskRCNN] INFO : [#0201] mask_head/mask_fcn_logits/bias:0 => (1,)
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (307 Tensors) from: /tmp/tmp3zazdm1d/model.ckpt-0
[MaskRCNN] INFO : Pretrained weights loaded with success…

2021-04-12 20:37:14.778236: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt.
2021-04-12 20:37:29.225699: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-12 20:37:29.779242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

Morganh · April 14, 2021, 3:17am

Can you share your training spec?
More, have you shared the full training log above? If possible, please share the .ipynb file with us. Thanks.

mohsen.zardadi · April 14, 2021, 4:25am

Here is the training spec:

seed: 1234
use_amp: False
warmup_steps: 1000
checkpoint: “/workspace/server/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5”
learning_rate_steps: “[10000, 15000, 20000]”
learning_rate_decay_levels: “[0.01, 0.02, 0.01]”
total_steps: 20000
train_batch_size: 4
eval_batch_size: 4
num_steps_per_eval: 1000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0025
init_learning_rate: 0.005

data_config{
image_size: “(832, 576)”
augment_input_data: True
eval_samples: 50
training_file_pattern: “/workspace/server/tlt-experiments/IRUV/train/train.tfrecord”
validation_file_pattern: “/workspace/server/tlt-experiments/IRUV/val/val.tfrecord”
val_json_file: “/workspace/server/tlt-experiments/IRUV/raw-data/val/IRUV_val_v1.json”

# dataset specific parameters
num_classes: 1
skip_crowd_during_training: True

}

maskrcnn_config {
nlayers: 50
arch: “resnet”
freeze_bn: True
freeze_blocks: “[0,1]”
gt_mask_size: 112

# Region Proposal Network
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_min_size: 0.

# Proposal layer.
batch_size_per_im: 512
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.

# Faster-RCNN heads.
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: "(10., 10., 5., 5.)"

# Mask-RCNN heads.
include_mask: True
mrcnn_resolution: 28

# training
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7

# evaluation
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000
test_rpn_nms_thresh: 0.7

# model architecture
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
anchor_scale: 8

# localization loss
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0

}

mohsen.zardadi · April 14, 2021, 4:42am

log.txt file and Jupyter notebook are available here: MaskRCNN

Morganh · April 14, 2021, 6:15am

I request the access for it, please help approve.

Morganh · April 14, 2021, 3:39pm

I can access now, but the ipynb file does not contain running log. Could you please double check if it is the exact jupyter notebook you were running ?

mohsen.zardadi · April 14, 2021, 4:25pm

I have updated the notebook. Have a look, please.

mohsen.zardadi · April 14, 2021, 4:28pm

I have also tried training using command line to test if that will help, but the issue still exists.

Morganh · April 14, 2021, 4:34pm

Please generate a new result folder and retry.

!mkdir -p $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new

!tlt-train mask_rcnn -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new
-k $KEY
–gpus 2

More, please try --gpus 1 too.

mohsen.zardadi · April 14, 2021, 5:00pm

Using --gpus 1 started the training process. But the validation results are not looking normal. I have trained the model for 20,000 iterations and shared the new Jupyter notebook with running log at the same folder. Please have a look at step 20,000 iteration to see the validation results. I have also visualized the inferences to see if there is any mask generated by the model but there is no mask generated by the model.

mohsen.zardadi · April 14, 2021, 5:22pm

All the validation results are like below:

[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Evaluation Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

[MaskRCNN] INFO : Average throughput: -1.0 samples/sec
[MaskRCNN] INFO : Total processed steps: 12
[MaskRCNN] INFO : Total processing time: 0.0h 05m 07s
[MaskRCNN] INFO : ==================== Metrics ====================
[MaskRCNN] INFO : AP: -1.000000000
[MaskRCNN] INFO : AP50: -1.000000000
[MaskRCNN] INFO : AP75: -1.000000000
[MaskRCNN] INFO : APl: -1.000000000
[MaskRCNN] INFO : APm: -1.000000000
[MaskRCNN] INFO : APs: -1.000000000
[MaskRCNN] INFO : ARl: -1.000000000
[MaskRCNN] INFO : ARm: -1.000000000
[MaskRCNN] INFO : ARmax1: -1.000000000
[MaskRCNN] INFO : ARmax10: -1.000000000
[MaskRCNN] INFO : ARmax100: -1.000000000
[MaskRCNN] INFO : ARs: -1.000000000
[MaskRCNN] INFO : mask_AP: -1.000000000
[MaskRCNN] INFO : mask_AP50: -1.000000000
[MaskRCNN] INFO : mask_AP75: -1.000000000
[MaskRCNN] INFO : mask_APl: -1.000000000
[MaskRCNN] INFO : mask_APm: -1.000000000
[MaskRCNN] INFO : mask_APs: -1.000000000
[MaskRCNN] INFO : mask_ARl: -1.000000000
[MaskRCNN] INFO : mask_ARm: -1.000000000
[MaskRCNN] INFO : mask_ARmax1: -1.000000000
[MaskRCNN] INFO : mask_ARmax10: -1.000000000
[MaskRCNN] INFO : mask_ARmax100: -1.000000000
[MaskRCNN] INFO : mask_ARs: -1.000000000

Morganh · April 15, 2021, 2:56am

According to your training log, the loss is always 0.

[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors) from: /tmp/tmpms41248t
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned_new/model.step-0.tlt.
2021-04-14 16:43:14.064027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-14 16:43:14.550482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[MaskRCNN] INFO : timestamp: 1618418607.359632
[MaskRCNN] INFO : iteration: 5
DLL 2021-04-14 16:43:27.360372 - iteration : 5
[MaskRCNN] INFO : throughput: 0.7 samples/sec
DLL 2021-04-14 16:43:27.360733 - Iteration: 5 throughput : 0.7481762963747831
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.0000e+00
DLL 2021-04-14 16:43:27.361792 - Iteration: 5 FastRCNN box loss : 0.0000e+00
[MaskRCNN] INFO : FastRCNN class loss: 0.0000e+00
DLL 2021-04-14 16:43:27.361972 - Iteration: 5 FastRCNN class loss : 0.0000e+00
[MaskRCNN] INFO : FastRCNN total loss: 0.0000e+00
DLL 2021-04-14 16:43:27.362140 - Iteration: 5 FastRCNN total loss : 0.0000e+00
[MaskRCNN] INFO : L2 loss: 2.2248
DLL 2021-04-14 16:43:27.362373 - Iteration: 5 L2 loss : 2.2248
[MaskRCNN] INFO : Learning rate: 0.00251
DLL 2021-04-14 16:43:27.362570 - Iteration: 5 Learning rate : 0.00251
[MaskRCNN] INFO : Mask loss: 0.0000e+00
DLL 2021-04-14 16:43:27.362734 - Iteration: 5 Mask loss : 0.0000e+00
[MaskRCNN] INFO : RPN box loss: 0.0000e+00
DLL 2021-04-14 16:43:27.362904 - Iteration: 5 RPN box loss : 0.0000e+00
[MaskRCNN] INFO : RPN score loss: 0.33336
DLL 2021-04-14 16:43:27.363092 - Iteration: 5 RPN score loss : 0.33336
[MaskRCNN] INFO : RPN total loss: 0.33336
DLL 2021-04-14 16:43:27.363275 - Iteration: 5 RPN total loss : 0.33336
[MaskRCNN] INFO : Total loss: 2.55815
DLL 2021-04-14 16:43:27.363456 - Iteration: 5 Total loss : 2.55815

[MaskRCNN] INFO : timestamp: 1618418610.629227
[MaskRCNN] INFO : iteration: 10
DLL 2021-04-14 16:43:30.629827 - iteration : 10
[MaskRCNN] INFO : throughput: 7.9 samples/sec
DLL 2021-04-14 16:43:30.630085 - Iteration: 10 throughput : 7.855778068815487
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.0000e+00
DLL 2021-04-14 16:43:30.630939 - Iteration: 10 FastRCNN box loss : 0.0000e+00
[MaskRCNN] INFO : FastRCNN class loss: 0.0000e+00
DLL 2021-04-14 16:43:30.631105 - Iteration: 10 FastRCNN class loss : 0.0000e+00
[MaskRCNN] INFO : FastRCNN total loss: 0.0000e+00
DLL 2021-04-14 16:43:30.631265 - Iteration: 10 FastRCNN total loss : 0.0000e+00
[MaskRCNN] INFO : L2 loss: 2.22478
DLL 2021-04-14 16:43:30.631478 - Iteration: 10 L2 loss : 2.22478
[MaskRCNN] INFO : Learning rate: 0.00252
DLL 2021-04-14 16:43:30.631667 - Iteration: 10 Learning rate : 0.00252
[MaskRCNN] INFO : Mask loss: 0.0000e+00
DLL 2021-04-14 16:43:30.631826 - Iteration: 10 Mask loss : 0.0000e+00
[MaskRCNN] INFO : RPN box loss: 0.0000e+00
DLL 2021-04-14 16:43:30.631979 - Iteration: 10 RPN box loss : 0.0000e+00
[MaskRCNN] INFO : RPN score loss: 0.0585
DLL 2021-04-14 16:43:30.632160 - Iteration: 10 RPN score loss : 0.0585
[MaskRCNN] INFO : RPN total loss: 0.0585
DLL 2021-04-14 16:43:30.632340 - Iteration: 10 RPN total loss : 0.0585
[MaskRCNN] INFO : Total loss: 2.28327
DLL 2021-04-14 16:43:30.632523 - Iteration: 10 Total loss : 2.28327

Can you describe the details about how you generate below training/val file ?

training_file_pattern: "/workspace/server/tlt-experiments/IRUV/train/train.tfrecord"
validation_file_pattern: "/workspace/server/tlt-experiments/IRUV/val/val.tfrecord"
val_json_file: "/workspace/server/tlt-experiments/IRUV/raw-data/val/IRUV_val_v1.json"

mohsen.zardadi · April 15, 2021, 4:07am

Morganh:

astRCNN box loss: 0.0000e+00
DLL 2021-04-14 16:43:27.361792 - Iteration: 5 FastRCNN box loss : 0.0000e+00
[MaskRCNN] INFO : FastRCNN class loss: 0.0000e+00
DLL 2021-04-14 16:43:27.361972 - Iteration: 5 FastRCNN class loss : 0.0000e+00
[MaskRCNN] INFO : FastRCNN total loss: 0.0000e+00
DLL 2021-04-14 16:43:27.362140 - Iteration: 5 FastRCNN total loss : 0.0000e+00
[MaskRCNN] INFO : L2 loss: 2.2248
DLL 2021-04-14 16:43:27.362373 - Iteration: 5 L2 loss : 2.2248
[MaskRCNN] INFO : Learning rate: 0.00251
DLL 2021-04-14 16:43:27.362570 - Iteration: 5 Learning rate : 0.00251
[MaskRCNN] INFO : Mask loss: 0.0000e+00
DLL 2021-04-14 16:43:27.362734 - Iteration: 5 Mask loss : 0.0000e+00
[MaskRCNN] INFO : RPN box loss: 0.0000e+00
DLL 2021-04-14 16:43:27.362904 - Iteration: 5 RPN box loss : 0.0000e+00
[MaskRCNN] INFO : RPN score loss: 0.33336
DLL 2021-04-14 16:43:27.363092 - Iteration: 5 RPN score loss : 0.33336
[MaskRCNN] INFO : RPN total loss: 0.33336
DLL 2021-04-14 16:43:27.363275 - Iteration: 5 RPN total loss : 0.33336
[MaskRCNN] INFO : Total loss: 2.55815

I have downloaded them from CVAT after labeling all the images.
Using CVAT you can export annotations with different types of formats like COCO Json or Tfrecords. I have used the COCO Json format for different projects to train different models and it worked just fine always.

To make sure I am doing right I have also used this script to convert COCO Json annotations to Tfrecords. But tlt training got to this error:

[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors) from: /tmp/tmpdr7og2ok
[MaskRCNN] INFO : Pretrained weights loaded with success…

[MaskRCNN] INFO : Saving checkpoints for 0 into /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned_usingRecords/model.step-0.tlt.
2021-04-15 03:59:06.874454: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-04-15 03:59:07.193094: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.203427: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.209160: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.209160: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.209318: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.209915: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.210473: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.210477: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.210815: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.211099: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.213120: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.213418: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.217125: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.217296: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.218006: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.218678: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.220682: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.230828: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
2021-04-15 03:59:07.233208: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at gather_nd_op.cc:47 : Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15570}} Requested more than 0 entries, but params is empty. Params shape: [0,1]
[[{{node parser/GatherNd}}]]
[[IteratorGetNext]]
[[RemoteCall]]
[[IteratorGetNext]]
[[IteratorGetNext/_3567]]
(1) Invalid argument: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_15570}} Requested more than 0 entries, but params is empty. Params shape: [0,1]
[[{{node parser/GatherNd}}]]
[[IteratorGetNext]]
[[RemoteCall]]
[[IteratorGetNext]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 58, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 187, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 90, in run_executer
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/distributed_executer.py”, line 393, in train_and_eval
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 754, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1259, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1360, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python3.6/dist-packages/six.py”, line 693, in reraise
raise value
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1345, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1418, in run
run_metadata=run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1176, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
[[{{node parser/GatherNd}}]]
[[IteratorGetNext]]
[[RemoteCall]]
[[IteratorGetNext]]
[[IteratorGetNext/_3567]]
(1) Invalid argument: Requested more than 0 entries, but params is empty. Params shape: [0,1]
[[{{node parser/GatherNd}}]]
[[IteratorGetNext]]
[[RemoteCall]]
[[IteratorGetNext]]
0 successful operations.
0 derived errors ignored.

[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO : Training Performance Summary
[MaskRCNN] INFO : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-04-15 03:59:07.515959 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
DLL 2021-04-15 03:59:07.516130 - : Training Performance Summary
DLL 2021-04-15 03:59:07.516170 - : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

DLL 2021-04-15 03:59:07.516216 - Average_throughput : -1.0 samples/sec
DLL 2021-04-15 03:59:07.516254 - Total processed steps : 1
DLL 2021-04-15 03:59:07.516297 - Total_processing_time : 0h 00m 00s
[MaskRCNN] INFO : Average throughput: -1.0 samples/sec
[MaskRCNN] INFO : Total processed steps: 1
[MaskRCNN] INFO : Total processing time: 0h 00m 00s
DLL 2021-04-15 03:59:07.516508 - : ==================== Metrics ====================
[MaskRCNN] INFO : ==================== Metrics ====================

[MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE

Morganh · April 15, 2021, 9:27am

Can you shed more light on the dataset you were using?
In your ipynb file, you mention that

We will be using the COCO dataset for the tutorial. The following script will convert IRUV and SLAR COCO flavor data into TFRecords.
# tfrecord for training and validation have been downloaded from CVAT! (GitHub - opencv/cvat: Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.)
#!bash convert_coco-record.sh
#!bash convert_coco-tf-record.sh

Could you share the exact link for the dataset images and where can I find the convert_coco-record.sh and convert_coco-tf-record.sh?

mohsen.zardadi · April 15, 2021, 4:52pm

I have shared with you the convert_coco-redord.sh which converts annotations from Json COCO format to Tfrecords.

Morganh · April 15, 2021, 4:57pm

Thanks. BTW, what is the raw dataset, is it a public dataset or your private one?

mohsen.zardadi · April 15, 2021, 5:00pm

A private dataset.

Morganh · April 15, 2021, 5:03pm

Got it. Thanks for the info. I will dig out more.

Morganh · April 17, 2021, 5:37pm

Your tfrecords file does not contain any ‘image/object/mask’.

More, your create_coco_record.py is different from the create_coco_tf_record.py inside the TLT 2.0 docker. Please refer to it and modify

flags.DEFINE_boolean(‘include_masks’, False,

to

flags.DEFINE_boolean(‘include_masks’, True,

Then, generate tfrecord files via following command. Note, below command just a reference. It only generate training tfrecords files.

PYTHONPATH=“tf-models:tf-models/research” python create_coco_tf_record.py --train_image_dir=./your_images_folder --train_object_annotations_file=./IRUV_train_v1.json --output_dir=./result --train_caption_annotations_file=dummy_file.json

mohsen.zardadi · April 18, 2021, 5:34am

Thanks for your reply, I have generated new tfrecords using create_coco_tf_record.py with flags.DEFINE_boolean(‘include_masks’, True, but the validation results during training doesn’t look promising yet. I have checked the new tfrecords files using

for example in tf.python_io.tf_record_iterator("data/foobar.tfrecord"):
   print(tf.train.Example.FromString(example))

to see if the tfrecords files contain image/object/mask and they look good.
I have shared with you the new tfrecords, create_coco_tf_record.py, and the Jupyter notebook with the running log.

DONE (t=0.00s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Running per image evaluation...
Evaluate annotation type *segm*
DONE (t=0.01s).
Accumulating evaluation results...
DONE (t=0.00s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000

[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #
[MaskRCNN] INFO    :          Evaluation Performance Summary          
[MaskRCNN] INFO    : # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ #

[MaskRCNN] INFO    : Average throughput: -1.0         samples/sec
[MaskRCNN] INFO    : Total processed steps:         12
[MaskRCNN] INFO    : Total processing time: 0.0h 03m 45s
[MaskRCNN] INFO    : ==================== Metrics ====================
[MaskRCNN] INFO    : AP: 0.000000000
[MaskRCNN] INFO    : AP50: 0.000000000
[MaskRCNN] INFO    : AP75: 0.000000000
[MaskRCNN] INFO    : APl: 0.000000000
[MaskRCNN] INFO    : APm: 0.000000000
[MaskRCNN] INFO    : APs: 0.000000000
[MaskRCNN] INFO    : ARl: 0.000000000
[MaskRCNN] INFO    : ARm: 0.000000000
[MaskRCNN] INFO    : ARmax1: 0.000000000
[MaskRCNN] INFO    : ARmax10: 0.000000000
[MaskRCNN] INFO    : ARmax100: 0.000000000
[MaskRCNN] INFO    : ARs: 0.000000000
[MaskRCNN] INFO    : mask_AP: 0.000000000
[MaskRCNN] INFO    : mask_AP50: 0.000000000
[MaskRCNN] INFO    : mask_AP75: 0.000000000
[MaskRCNN] INFO    : mask_APl: 0.000000000
[MaskRCNN] INFO    : mask_APm: 0.000000000
[MaskRCNN] INFO    : mask_APs: 0.000000000
[MaskRCNN] INFO    : mask_ARl: 0.000000000
[MaskRCNN] INFO    : mask_ARm: 0.000000000
[MaskRCNN] INFO    : mask_ARmax1: 0.000000000
[MaskRCNN] INFO    : mask_ARmax10: 0.000000000
[MaskRCNN] INFO    : mask_ARmax100: 0.000000000
[MaskRCNN] INFO    : mask_ARs: 0.000000000

DLL 2021-04-18 05:02:52.799819 - Iteration: 2000 Validation Iteration: 2000  AP : 0.0 
DLL 2021-04-18 05:02:52.799975 - Iteration: 2000 Validation Iteration: 2000  AP50 : 0.0 
DLL 2021-04-18 05:02:52.800025 - Iteration: 2000 Validation Iteration: 2000  AP75 : 0.0 
DLL 2021-04-18 05:02:52.800067 - Iteration: 2000 Validation Iteration: 2000  APs : 0.0 
DLL 2021-04-18 05:02:52.800108 - Iteration: 2000 Validation Iteration: 2000  APm : 0.0 
DLL 2021-04-18 05:02:52.800148 - Iteration: 2000 Validation Iteration: 2000  APl : 0.0 
DLL 2021-04-18 05:02:52.800187 - Iteration: 2000 Validation Iteration: 2000  ARmax1 : 0.0 
DLL 2021-04-18 05:02:52.800225 - Iteration: 2000 Validation Iteration: 2000  ARmax10 : 0.0 
DLL 2021-04-18 05:02:52.800326 - Iteration: 2000 Validation Iteration: 2000  ARmax100 : 0.0 
DLL 2021-04-18 05:02:52.800369 - Iteration: 2000 Validation Iteration: 2000  ARs : 0.0 
DLL 2021-04-18 05:02:52.800408 - Iteration: 2000 Validation Iteration: 2000  ARm : 0.0 
DLL 2021-04-18 05:02:52.800444 - Iteration: 2000 Validation Iteration: 2000  ARl : 0.0 
DLL 2021-04-18 05:02:52.800480 - Iteration: 2000 Validation Iteration: 2000  mask_AP : 0.0 
DLL 2021-04-18 05:02:52.800519 - Iteration: 2000 Validation Iteration: 2000  mask_AP50 : 0.0 
DLL 2021-04-18 05:02:52.800555 - Iteration: 2000 Validation Iteration: 2000  mask_AP75 : 0.0 
DLL 2021-04-18 05:02:52.800592 - Iteration: 2000 Validation Iteration: 2000  mask_APs : 0.0 
DLL 2021-04-18 05:02:52.800629 - Iteration: 2000 Validation Iteration: 2000  mask_APm : 0.0 
DLL 2021-04-18 05:02:52.800667 - Iteration: 2000 Validation Iteration: 2000  mask_APl : 0.0 
DLL 2021-04-18 05:02:52.800704 - Iteration: 2000 Validation Iteration: 2000  mask_ARmax1 : 0.0 
DLL 2021-04-18 05:02:52.800741 - Iteration: 2000 Validation Iteration: 2000  mask_ARmax10 : 0.0 
DLL 2021-04-18 05:02:52.800809 - Iteration: 2000 Validation Iteration: 2000  mask_ARmax100 : 0.0 
DLL 2021-04-18 05:02:52.800850 - Iteration: 2000 Validation Iteration: 2000  mask_ARs : 0.0 
DLL 2021-04-18 05:02:52.800889 - Iteration: 2000 Validation Iteration: 2000  mask_ARm : 0.0 
DLL 2021-04-18 05:02:52.800926 - Iteration: 2000 Validation Iteration: 2000  mask_ARl : 0.0