TLT train maskrcnn model with Mapillary Vistas Dataset failed on CUDA_ERROR_OUT_OF_MEMORY: out of memory

Operating System: Ubuntu 16.04
TLT: nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3
GPUs: 3
GPU Spec: GeForce GTX 1080

I would like to train a Mask R-CNN model with the Mapillary Vistas dataset. I’ve already converted the dataset’s annotations to COCO’s json format and converted the training and validation images to tfrecord format before training.

My maskrcnn_train_resnet50.txt file is as below,
use_amp: False
warmup_steps: 1000
checkpoint: “/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5”
learning_rate_steps: “[10000, 15000, 20000]”
learning_rate_decay_levels: “[0.1, 0.02, 0.01]”
total_steps: 25000
train_batch_size: 1
eval_batch_size: 1
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.01

data_config{
image_size: “(1080, 1920)”
augment_input_data: True
eval_samples: 2000 #500
training_file_pattern: “/workspace/tlt-experiments/mapillary/train*.tfrecord”
validation_file_pattern: “/workspace/tlt-experiments/mapillary/val*.tfrecord”
val_json_file: “/workspace/tlt-experiments/mapillary/annotations/instances_shape_validation2020_v1.2.json”

# dataset specific parameters
num_classes: 37 #91
skip_crowd_during_training: True

}

maskrcnn_config {
nlayers: 50
arch: “resnet”
freeze_bn: True
freeze_blocks: “[0,1]”
gt_mask_size: 112

# Region Proposal Network
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 256
rpn_fg_fraction: 0.5
rpn_min_size: 0.

# Proposal layer.
batch_size_per_im: 512
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.

# Faster-RCNN heads.
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: "(10., 10., 5., 5.)"

# Mask-RCNN heads.
include_mask: True
mrcnn_resolution: 28

# training
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7

# evaluation
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000
test_rpn_nms_thresh: 0.7

# model architecture
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
anchor_scale: 8

# localization loss
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0

}

But I keep getting the error during tlt-train step:

2021-05-07 06:06:33.418863: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 4074 of 4096
2021-05-07 06:06:38.796773: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-07 06:06:38.870466: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -32 must be >= 0
2021-05-07 06:06:38.871678: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -32 must be >= 0
2021-05-07 06:06:38.872646: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -63 must be >= 0
2021-05-07 06:06:41.053554: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-07 06:06:41.643028: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -18 must be >= 0
2021-05-07 06:06:41.660554: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -6 must be >= 0
2021-05-07 06:06:48.810039: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:48.810683: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2021-05-07 06:06:48.810984: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 15461881856 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:48.811089: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 15461881856
2021-05-07 06:06:51.391088: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499366: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184
2021-05-07 06:06:51.499605: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 15461881856 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499624: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 15461881856
2021-05-07 06:06:51.499682: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 13915693056 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499697: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 13915693056
2021-05-07 06:06:51.499738: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 12524123136 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499751: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 12524123136
2021-05-07 06:06:51.499791: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 11271710720 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499803: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 11271710720
2021-05-07 06:06:51.499843: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 10144539648 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499855: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 10144539648
2021-05-07 06:06:51.499895: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 9130085376 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499907: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 9130085376
2021-05-07 06:06:51.499948: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 8217076736 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.499961: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8217076736
2021-05-07 06:06:51.500000: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7395368960 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.500012: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7395368960
2021-05-07 06:06:51.500065: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 6655832064 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 06:06:51.500077: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 6655832064
/usr/local/bin/tlt-train: line 32: 11106 Killed tlt-train-g1 ${PYTHON_ARGS[*]}

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[38151,1],2]
Exit code: 137

I’ve tried to export TF_FORCE_GPU_ALLOW_GROWTH=true before training. But it didn’t work. Could please someone help to target the problem? Thanks a lot.

Please train with W, H which are multiples of 32.

https://docs.nvidia.com/metropolis/TLT/archive/tlt-20/tlt-user-guide/text/supported_model_architectures.html#instance-segmentation

Thank you for your reply. I changed the W, H by modifying the image_size to “(1024, 1920)”, but still got the same error. Anything else I can do?

Could you share the latest error log?

2021-05-07 08:46:02.542971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:02.542968: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:02.542979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[MaskRCNN] INFO    : Horovod successfully initialized ...
[MaskRCNN] INFO    : Loading pretrained model...
2021-05-07 08:46:20.875842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-05-07 08:46:20.879285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-05-07 08:46:20.897496: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-05-07 08:46:21.100075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2021-05-07 08:46:21.100175: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:21.103503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:81:00.0
2021-05-07 08:46:21.103548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:21.123577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2021-05-07 08:46:21.123624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:21.312454: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:21.312489: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:21.312500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:21.409826: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:21.409874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:21.409879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:21.468476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:21.468486: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:21.468472: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:21.658871: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:21.658853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:21.658906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:21.801361: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:21.801408: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:21.801413: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:22.228167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:22.228157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:22.228145: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:22.235411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:22.235499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:22.235519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:22.263862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:81:00.0
2021-05-07 08:46:22.263910: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:22.263971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:22.263974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2021-05-07 08:46:22.264036: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:22.264020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2021-05-07 08:46:22.264047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:22.264096: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:22.264087: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:22.264140: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:22.264136: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:22.264153: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:22.264194: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:22.264206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:22.264205: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:22.264243: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:22.264259: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:22.264271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:22.264308: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:22.264334: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:22.264362: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:22.264398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:22.264412: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:22.264463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:22.269651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:22.271389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:22.271437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:22.288579: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:22.288603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:22.288605: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:28.648914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:28.648988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:28.649000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:28.655268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:28.655309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:28.655322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:28.658929: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-05-07 08:46:28.658935: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-05-07 08:46:28.659006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6958 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
2021-05-07 08:46:28.659029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7132 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-05-07 08:46:28.731861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:28.731922: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:28.731936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:28.734879: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-05-07 08:46:28.734938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7132 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:81:00.0, compute capability: 6.1)
[MaskRCNN] INFO    : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 01
[MaskRCNN] INFO    : =================================

[MaskRCNN] INFO    : Using Dataset Sharding with Horovod
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: multilevel_propose_rois/level_6/
2021-05-07 08:46:46.656982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2021-05-07 08:46:46.657124: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:46.657291: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:46.657333: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:46.657368: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:46.657403: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:46.657438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:46.657473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:46.658716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:46.658774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:46.658789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:46.658798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:46.660047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7132 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-05-07 08:46:46.785946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2021-05-07 08:46:46.786079: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:46.786233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:46.786277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:46.786313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:46.786348: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:46.786386: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:46.786422: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:46.787621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:46.787701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:46.787715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:46.787725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:46.788988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6958 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
2021-05-07 08:46:47.606248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:81:00.0
2021-05-07 08:46:47.606387: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:47.606558: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:47.606601: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:47.606640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:47.606682: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:47.606718: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:47.606754: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:47.607977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:47.608087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:47.608103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:47.608113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:47.609427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7132 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:81:00.0, compute capability: 6.1)
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 854.4 GFLOPS/image
Using TensorFlow backend.
4 ops no flops stats due to incomplete shapes.
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/l5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d4/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d4/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d5/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [fpn/post_hoc_d5/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn_head/rpn/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn_head/rpn/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn_head/rpn-class/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn_head/rpn-class/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn_head/rpn-box/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [rpn_head/rpn-box/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/fc6/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/fc6/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/fc7/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/fc7/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/class-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/class-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/box-predict/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [box_head/box-predict/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l0/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l0/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l1/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l1/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l2/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l2/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l3/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask-conv-l3/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/conv5-mask/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/conv5-mask/bias]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask_fcn_logits/kernel]
[MaskRCNN] WARNING : Checkpoint is missing variable [mask_head/mask_fcn_logits/bias]
2021-05-07 08:46:56.893241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2021-05-07 08:46:56.893320: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:56.893393: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:56.893438: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:56.893477: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:56.893518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:56.893555: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:56.893596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:56.894946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:56.895000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:56.895017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:56.895029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:56.896454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7132 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2021-05-07 08:46:58.453355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:81:00.0
2021-05-07 08:46:58.453493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:46:58.453775: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:46:58.453819: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:46:58.453856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:46:58.453892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:46:58.453926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:46:58.453961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:46:58.455164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:46:58.455260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:46:58.455275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:46:58.455284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:46:58.456590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7132 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:81:00.0, compute capability: 6.1)
2021-05-07 08:47:01.851362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:02:00.0
2021-05-07 08:47:01.851484: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-07 08:47:01.851681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:47:01.851726: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-07 08:47:01.851760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-07 08:47:01.851811: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-07 08:47:01.851845: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-07 08:47:01.851883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-07 08:47:01.853008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-07 08:47:01.853064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 08:47:01.853075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2021-05-07 08:47:01.853083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2021-05-07 08:47:01.854326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6958 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
2021-05-07 08:47:13.132231: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2021-05-07 08:47:13.138571: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2021-05-07 08:47:14.538473: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
[MaskRCNN] INFO    : ============================ GIT REPOSITORY ============================
[MaskRCNN] INFO    : BRANCH NAME:
[MaskRCNN] INFO    : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[MaskRCNN] INFO    : ============================ MODEL STATISTICS ===========================
[MaskRCNN] INFO    : # Model Weights: 44,217,005
[MaskRCNN] INFO    : # Trainable Weights: 44,163,885
[MaskRCNN] INFO    : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[MaskRCNN] INFO    : ============================ TRAINABLE VARIABLES ========================
[MaskRCNN] INFO    : [#0001] conv1/kernel:0                                               => (7, 7, 3, 64)
[MaskRCNN] INFO    : [#0002] bn_conv1/gamma:0                                             => (64,)
[MaskRCNN] INFO    : [#0003] bn_conv1/beta:0                                              => (64,)
[MaskRCNN] INFO    : [#0004] block_1a_conv_1/kernel:0                                     => (1, 1, 64, 64)
[MaskRCNN] INFO    : [#0005] block_1a_bn_1/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0006] block_1a_bn_1/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0007] block_1a_conv_2/kernel:0                                     => (3, 3, 64, 64)
[MaskRCNN] INFO    : [#0008] block_1a_bn_2/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0009] block_1a_bn_2/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0010] block_1a_conv_3/kernel:0                                     => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0011] block_1a_bn_3/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0012] block_1a_bn_3/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0013] block_1a_conv_shortcut/kernel:0                              => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0014] block_1a_bn_shortcut/gamma:0                                 => (256,)
[MaskRCNN] INFO    : [#0015] block_1a_bn_shortcut/beta:0                                  => (256,)
[MaskRCNN] INFO    : [#0016] block_1b_conv_1/kernel:0                                     => (1, 1, 256, 64)
[MaskRCNN] INFO    : [#0017] block_1b_bn_1/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0018] block_1b_bn_1/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0019] block_1b_conv_2/kernel:0                                     => (3, 3, 64, 64)
[MaskRCNN] INFO    : [#0020] block_1b_bn_2/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0021] block_1b_bn_2/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0022] block_1b_conv_3/kernel:0                                     => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0023] block_1b_bn_3/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0024] block_1b_bn_3/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0025] block_1c_conv_1/kernel:0                                     => (1, 1, 256, 64)
[MaskRCNN] INFO    : [#0026] block_1c_bn_1/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0027] block_1c_bn_1/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0028] block_1c_conv_2/kernel:0                                     => (3, 3, 64, 64)
[MaskRCNN] INFO    : [#0029] block_1c_bn_2/gamma:0                                        => (64,)
[MaskRCNN] INFO    : [#0030] block_1c_bn_2/beta:0                                         => (64,)
[MaskRCNN] INFO    : [#0031] block_1c_conv_3/kernel:0                                     => (1, 1, 64, 256)
[MaskRCNN] INFO    : [#0032] block_1c_bn_3/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0033] block_1c_bn_3/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0034] block_2a_conv_1/kernel:0                                     => (1, 1, 256, 128)
[MaskRCNN] INFO    : [#0035] block_2a_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0036] block_2a_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0037] block_2a_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0038] block_2a_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0039] block_2a_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0040] block_2a_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0041] block_2a_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0042] block_2a_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0043] block_2a_conv_shortcut/kernel:0                              => (1, 1, 256, 512)
[MaskRCNN] INFO    : [#0044] block_2a_bn_shortcut/gamma:0                                 => (512,)
[MaskRCNN] INFO    : [#0045] block_2a_bn_shortcut/beta:0                                  => (512,)
[MaskRCNN] INFO    : [#0046] block_2b_conv_1/kernel:0                                     => (1, 1, 512, 128)
[MaskRCNN] INFO    : [#0047] block_2b_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0048] block_2b_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0049] block_2b_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0050] block_2b_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0051] block_2b_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0052] block_2b_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0053] block_2b_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0054] block_2b_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0055] block_2c_conv_1/kernel:0                                     => (1, 1, 512, 128)
[MaskRCNN] INFO    : [#0056] block_2c_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0057] block_2c_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0058] block_2c_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0059] block_2c_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0060] block_2c_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0061] block_2c_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0062] block_2c_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0063] block_2c_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0064] block_2d_conv_1/kernel:0                                     => (1, 1, 512, 128)
[MaskRCNN] INFO    : [#0065] block_2d_bn_1/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0066] block_2d_bn_1/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0067] block_2d_conv_2/kernel:0                                     => (3, 3, 128, 128)
[MaskRCNN] INFO    : [#0068] block_2d_bn_2/gamma:0                                        => (128,)
[MaskRCNN] INFO    : [#0069] block_2d_bn_2/beta:0                                         => (128,)
[MaskRCNN] INFO    : [#0070] block_2d_conv_3/kernel:0                                     => (1, 1, 128, 512)
[MaskRCNN] INFO    : [#0071] block_2d_bn_3/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0072] block_2d_bn_3/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0073] block_3a_conv_1/kernel:0                                     => (1, 1, 512, 256)
[MaskRCNN] INFO    : [#0074] block_3a_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0075] block_3a_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0076] block_3a_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0077] block_3a_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0078] block_3a_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0079] block_3a_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0080] block_3a_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0081] block_3a_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0082] block_3a_conv_shortcut/kernel:0                              => (1, 1, 512, 1024)
[MaskRCNN] INFO    : [#0083] block_3a_bn_shortcut/gamma:0                                 => (1024,)
[MaskRCNN] INFO    : [#0084] block_3a_bn_shortcut/beta:0                                  => (1024,)
[MaskRCNN] INFO    : [#0085] block_3b_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0086] block_3b_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0087] block_3b_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0088] block_3b_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0089] block_3b_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0090] block_3b_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0091] block_3b_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0092] block_3b_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0093] block_3b_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0094] block_3c_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0095] block_3c_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0096] block_3c_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0097] block_3c_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0098] block_3c_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0099] block_3c_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0100] block_3c_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0101] block_3c_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0102] block_3c_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0103] block_3d_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0104] block_3d_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0105] block_3d_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0106] block_3d_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0107] block_3d_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0108] block_3d_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0109] block_3d_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0110] block_3d_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0111] block_3d_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0112] block_3e_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0113] block_3e_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0114] block_3e_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0115] block_3e_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0116] block_3e_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0117] block_3e_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0118] block_3e_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0119] block_3e_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0120] block_3e_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0121] block_3f_conv_1/kernel:0                                     => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0122] block_3f_bn_1/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0123] block_3f_bn_1/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0124] block_3f_conv_2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0125] block_3f_bn_2/gamma:0                                        => (256,)
[MaskRCNN] INFO    : [#0126] block_3f_bn_2/beta:0                                         => (256,)
[MaskRCNN] INFO    : [#0127] block_3f_conv_3/kernel:0                                     => (1, 1, 256, 1024)
[MaskRCNN] INFO    : [#0128] block_3f_bn_3/gamma:0                                        => (1024,)
[MaskRCNN] INFO    : [#0129] block_3f_bn_3/beta:0                                         => (1024,)
[MaskRCNN] INFO    : [#0130] block_4a_conv_1/kernel:0                                     => (1, 1, 1024, 512)
[MaskRCNN] INFO    : [#0131] block_4a_bn_1/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0132] block_4a_bn_1/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0133] block_4a_conv_2/kernel:0                                     => (3, 3, 512, 512)
[MaskRCNN] INFO    : [#0134] block_4a_bn_2/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0135] block_4a_bn_2/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0136] block_4a_conv_3/kernel:0                                     => (1, 1, 512, 2048)
[MaskRCNN] INFO    : [#0137] block_4a_bn_3/gamma:0                                        => (2048,)
[MaskRCNN] INFO    : [#0138] block_4a_bn_3/beta:0                                         => (2048,)
[MaskRCNN] INFO    : [#0139] block_4a_conv_shortcut/kernel:0                              => (1, 1, 1024, 2048)
[MaskRCNN] INFO    : [#0140] block_4a_bn_shortcut/gamma:0                                 => (2048,)
[MaskRCNN] INFO    : [#0141] block_4a_bn_shortcut/beta:0                                  => (2048,)
[MaskRCNN] INFO    : [#0142] block_4b_conv_1/kernel:0                                     => (1, 1, 2048, 512)
[MaskRCNN] INFO    : [#0143] block_4b_bn_1/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0144] block_4b_bn_1/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0145] block_4b_conv_2/kernel:0                                     => (3, 3, 512, 512)
[MaskRCNN] INFO    : [#0146] block_4b_bn_2/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0147] block_4b_bn_2/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0148] block_4b_conv_3/kernel:0                                     => (1, 1, 512, 2048)
[MaskRCNN] INFO    : [#0149] block_4b_bn_3/gamma:0                                        => (2048,)
[MaskRCNN] INFO    : [#0150] block_4b_bn_3/beta:0                                         => (2048,)
[MaskRCNN] INFO    : [#0151] block_4c_conv_1/kernel:0                                     => (1, 1, 2048, 512)
[MaskRCNN] INFO    : [#0152] block_4c_bn_1/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0153] block_4c_bn_1/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0154] block_4c_conv_2/kernel:0                                     => (3, 3, 512, 512)
[MaskRCNN] INFO    : [#0155] block_4c_bn_2/gamma:0                                        => (512,)
[MaskRCNN] INFO    : [#0156] block_4c_bn_2/beta:0                                         => (512,)
[MaskRCNN] INFO    : [#0157] block_4c_conv_3/kernel:0                                     => (1, 1, 512, 2048)
[MaskRCNN] INFO    : [#0158] block_4c_bn_3/gamma:0                                        => (2048,)
[MaskRCNN] INFO    : [#0159] block_4c_bn_3/beta:0                                         => (2048,)
[MaskRCNN] INFO    : [#0160] fpn/l2/kernel:0                                              => (1, 1, 256, 256)
[MaskRCNN] INFO    : [#0161] fpn/l2/bias:0                                                => (256,)
[MaskRCNN] INFO    : [#0162] fpn/l3/kernel:0                                              => (1, 1, 512, 256)
[MaskRCNN] INFO    : [#0163] fpn/l3/bias:0                                                => (256,)
[MaskRCNN] INFO    : [#0164] fpn/l4/kernel:0                                              => (1, 1, 1024, 256)
[MaskRCNN] INFO    : [#0165] fpn/l4/bias:0                                                => (256,)
[MaskRCNN] INFO    : [#0166] fpn/l5/kernel:0                                              => (1, 1, 2048, 256)
[MaskRCNN] INFO    : [#0167] fpn/l5/bias:0                                                => (256,)
[MaskRCNN] INFO    : [#0168] fpn/post_hoc_d2/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0169] fpn/post_hoc_d2/bias:0                                       => (256,)
[MaskRCNN] INFO    : [#0170] fpn/post_hoc_d3/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0171] fpn/post_hoc_d3/bias:0                                       => (256,)
[MaskRCNN] INFO    : [#0172] fpn/post_hoc_d4/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0173] fpn/post_hoc_d4/bias:0                                       => (256,)
[MaskRCNN] INFO    : [#0174] fpn/post_hoc_d5/kernel:0                                     => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0175] fpn/post_hoc_d5/bias:0                                       => (256,)
[MaskRCNN] INFO    : [#0176] rpn_head/rpn/kernel:0                                        => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0177] rpn_head/rpn/bias:0                                          => (256,)
[MaskRCNN] INFO    : [#0178] rpn_head/rpn-class/kernel:0                                  => (1, 1, 256, 3)
[MaskRCNN] INFO    : [#0179] rpn_head/rpn-class/bias:0                                    => (3,)
[MaskRCNN] INFO    : [#0180] rpn_head/rpn-box/kernel:0                                    => (1, 1, 256, 12)
[MaskRCNN] INFO    : [#0181] rpn_head/rpn-box/bias:0                                      => (12,)
[MaskRCNN] INFO    : [#0182] box_head/fc6/kernel:0                                        => (12544, 1024)
[MaskRCNN] INFO    : [#0183] box_head/fc6/bias:0                                          => (1024,)
[MaskRCNN] INFO    : [#0184] box_head/fc7/kernel:0                                        => (1024, 1024)
[MaskRCNN] INFO    : [#0185] box_head/fc7/bias:0                                          => (1024,)
[MaskRCNN] INFO    : [#0186] box_head/class-predict/kernel:0                              => (1024, 37)
[MaskRCNN] INFO    : [#0187] box_head/class-predict/bias:0                                => (37,)
[MaskRCNN] INFO    : [#0188] box_head/box-predict/kernel:0                                => (1024, 148)
[MaskRCNN] INFO    : [#0189] box_head/box-predict/bias:0                                  => (148,)
[MaskRCNN] INFO    : [#0190] mask_head/mask-conv-l0/kernel:0                              => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0191] mask_head/mask-conv-l0/bias:0                                => (256,)
[MaskRCNN] INFO    : [#0192] mask_head/mask-conv-l1/kernel:0                              => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0193] mask_head/mask-conv-l1/bias:0                                => (256,)
[MaskRCNN] INFO    : [#0194] mask_head/mask-conv-l2/kernel:0                              => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0195] mask_head/mask-conv-l2/bias:0                                => (256,)
[MaskRCNN] INFO    : [#0196] mask_head/mask-conv-l3/kernel:0                              => (3, 3, 256, 256)
[MaskRCNN] INFO    : [#0197] mask_head/mask-conv-l3/bias:0                                => (256,)
[MaskRCNN] INFO    : [#0198] mask_head/conv5-mask/kernel:0                                => (2, 2, 256, 256)
[MaskRCNN] INFO    : [#0199] mask_head/conv5-mask/bias:0                                  => (256,)
[MaskRCNN] INFO    : [#0200] mask_head/mask_fcn_logits/kernel:0                           => (1, 1, 256, 37)
[MaskRCNN] INFO    : [#0201] mask_head/mask_fcn_logits/bias:0                             => (37,)
[MaskRCNN] INFO    : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


[MaskRCNN] INFO    : # ============================================= #
[MaskRCNN] INFO    :                  Start Training
[MaskRCNN] INFO    : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors) from: /tmp/tmpccybez5y
[MaskRCNN] INFO    : Pretrained weights loaded with success...

2021-05-07 08:47:24.473507: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:47:24.982595: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt.
2021-05-07 08:47:37.353308: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1 of 4096
2021-05-07 08:47:39.098429: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 4 of 4096
2021-05-07 08:47:47.309114: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3 of 4096
2021-05-07 08:47:49.406729: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-07 08:47:53.298856: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 8 of 4096
2021-05-07 08:47:57.225106: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 10 of 4096
2021-05-07 08:47:58.018365: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 69 of 4096
2021-05-07 08:48:03.007399: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1 of 4096
2021-05-07 08:48:13.569788: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 483 of 4096
2021-05-07 08:48:16.375460: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 484 of 4096
2021-05-07 08:48:16.669674: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3 of 4096
2021-05-07 08:48:20.057419: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 255 of 4096
2021-05-07 08:48:20.057519: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 256 of 4096
2021-05-07 08:48:22.040514: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 30 of 4096
2021-05-07 08:48:26.143315: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 489 of 4096
2021-05-07 08:48:29.763975: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 535 of 4096
2021-05-07 08:48:30.014142: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 35 of 4096
2021-05-07 08:48:39.176829: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 580 of 4096
2021-05-07 08:48:39.392500: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 733 of 4096
2021-05-07 08:48:42.866719: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 483 of 4096
2021-05-07 08:48:46.820752: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 675 of 4096
2021-05-07 08:48:48.584385: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 735 of 4096
2021-05-07 08:48:53.932793: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 485 of 4096
2021-05-07 08:48:56.303949: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 740 of 4096
2021-05-07 08:48:56.960826: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1015 of 4096
2021-05-07 08:49:03.202716: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 515 of 4096
2021-05-07 08:49:05.842169: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 970 of 4096
2021-05-07 08:49:06.729959: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1059 of 4096
2021-05-07 08:49:10.061027: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 830 of 4096
2021-05-07 08:49:24.359727: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1075 of 4096
2021-05-07 08:49:24.936116: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1224 of 4096
2021-05-07 08:49:26.080425: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 985 of 4096
2021-05-07 08:49:26.557329: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1225 of 4096
2021-05-07 08:49:28.262184: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1375 of 4096
2021-05-07 08:49:30.868334: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 999 of 4096
2021-05-07 08:49:36.821852: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1254 of 4096
2021-05-07 08:49:46.221370: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1525 of 4096
2021-05-07 08:49:47.126731: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1529 of 4096
2021-05-07 08:49:47.245917: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1530 of 4096
2021-05-07 08:49:50.399657: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1070 of 4096
2021-05-07 08:49:50.399730: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1071 of 4096
2021-05-07 08:49:57.347645: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1565 of 4096
2021-05-07 08:50:00.915932: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1469 of 4096
2021-05-07 08:50:05.554645: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1733 of 4096
2021-05-07 08:50:06.843654: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2024 of 4096
2021-05-07 08:50:10.253038: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1505 of 4096
2021-05-07 08:50:15.095021: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1734 of 4096
2021-05-07 08:50:19.324493: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2015 of 4096
2021-05-07 08:50:20.888882: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2065 of 4096
2021-05-07 08:50:23.222828: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1555 of 4096
2021-05-07 08:50:26.967044: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2075 of 4096
2021-05-07 08:50:27.762726: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2218 of 4096
2021-05-07 08:50:34.221730: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2003 of 4096
2021-05-07 08:50:40.393357: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2435 of 4096
2021-05-07 08:50:40.449466: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2220 of 4096
2021-05-07 08:50:47.482847: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2542 of 4096
2021-05-07 08:50:47.554430: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2258 of 4096
2021-05-07 08:50:47.661301: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2005 of 4096
2021-05-07 08:50:50.138642: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2040 of 4096
2021-05-07 08:50:56.591684: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2852 of 4096
2021-05-07 08:50:58.035594: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2666 of 4096
2021-05-07 08:51:01.098964: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2230 of 4096
2021-05-07 08:51:09.846221: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2961 of 4096
2021-05-07 08:51:10.807753: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2667 of 4096
2021-05-07 08:51:15.036738: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2502 of 4096
2021-05-07 08:51:17.051980: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2705 of 4096
2021-05-07 08:51:18.943390: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2962 of 4096
2021-05-07 08:51:20.178657: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2569 of 4096
2021-05-07 08:51:26.328317: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3269 of 4096
2021-05-07 08:51:31.537988: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2916 of 4096
2021-05-07 08:51:33.574865: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3096 of 4096
2021-05-07 08:51:38.082032: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3123 of 4096
2021-05-07 08:51:40.245056: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3271 of 4096
2021-05-07 08:51:40.370609: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2958 of 4096
2021-05-07 08:51:45.905999: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3186 of 4096
2021-05-07 08:51:47.763647: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3359 of 4096
2021-05-07 08:51:51.005368: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3053 of 4096
2021-05-07 08:51:56.178253: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3668 of 4096
2021-05-07 08:52:02.659789: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3339 of 4096
2021-05-07 08:52:07.187585: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3571 of 4096
2021-05-07 08:52:09.638476: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3717 of 4096
2021-05-07 08:52:10.763315: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3363 of 4096
2021-05-07 08:52:15.497225: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3572 of 4096
2021-05-07 08:52:15.839475: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3575 of 4096
2021-05-07 08:52:16.040486: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3765 of 4096
2021-05-07 08:52:22.537133: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3513 of 4096
2021-05-07 08:52:25.958448: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 4094 of 4096
2021-05-07 08:52:25.958889: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-07 08:52:27.550181: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3957 of 4096
2021-05-07 08:52:30.332923: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3739 of 4096
2021-05-07 08:52:34.178452: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -32 must be >= 0
2021-05-07 08:52:34.230711: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -63 must be >= 0
2021-05-07 08:52:34.231828: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -32 must be >= 0
2021-05-07 08:52:38.548901: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3958 of 4096
2021-05-07 08:52:41.928562: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3811 of 4096
2021-05-07 08:52:45.418771: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-07 08:52:47.049029: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -6 must be >= 0
2021-05-07 08:52:47.052097: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -18 must be >= 0
2021-05-07 08:52:47.060570: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.060793: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 34359738368
2021-05-07 08:52:47.061063: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 30923763712 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061091: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 30923763712
2021-05-07 08:52:47.061160: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 27831386112 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061182: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 27831386112
2021-05-07 08:52:47.061245: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 25048246272 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061266: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 25048246272
2021-05-07 08:52:47.061331: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 22543421440 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061351: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 22543421440
2021-05-07 08:52:47.061415: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 20289079296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061433: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 20289079296
2021-05-07 08:52:47.061495: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 18260170752 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061515: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 18260170752
2021-05-07 08:52:47.061579: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 16434153472 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061599: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 16434153472
2021-05-07 08:52:47.061663: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 14790737920 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061682: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 14790737920
2021-05-07 08:52:47.061747: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 13311664128 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061766: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 13311664128
2021-05-07 08:52:47.061830: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 11980496896 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061848: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 11980496896
2021-05-07 08:52:47.061911: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 10782446592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.061931: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 10782446592
2021-05-07 08:52:47.061993: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 9704201216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062012: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 9704201216
2021-05-07 08:52:47.062075: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 8733780992 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062094: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8733780992
2021-05-07 08:52:47.062159: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7860402688 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062178: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7860402688
2021-05-07 08:52:47.062240: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7074362368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062259: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7074362368
2021-05-07 08:52:47.062322: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 6366925824 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062342: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 6366925824
2021-05-07 08:52:47.062406: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 5730233344 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062425: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 5730233344
2021-05-07 08:52:47.062488: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 5157210112 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062507: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 5157210112
2021-05-07 08:52:47.062572: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 4641488896 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062591: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 4641488896
2021-05-07 08:52:47.062657: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 4177339904 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062676: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 4177339904
2021-05-07 08:52:47.062741: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 3759605760 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062761: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 3759605760
2021-05-07 08:52:47.062828: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 3383645184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062848: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 3383645184
2021-05-07 08:52:47.062914: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 3045280512 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.062933: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 3045280512
2021-05-07 08:52:47.062997: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 2740752384 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-05-07 08:52:47.063016: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 2740752384
2021-05-07 08:52:50.151184: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3956 of 4096
2021-05-07 08:52:51.880960: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-07 08:52:51.888347: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -31 must be >= 0
/usr/local/bin/tlt-train: line 32: 15458 Killed                  tlt-train-g1 ${PYTHON_ARGS[*]}
2021-05-07 08:53:16.984357: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -25 must be >= 0
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[33286,1],2]
  Exit code:    137
--------------------------------------------------------------------------

Here's the error log.

It is OOM issue. Could you try a lower image_size in the training spec and train again?

I’ve tried image_size “(768, 1024)” and “(512, 800)”, still out of memory. But I tried the example in notebook before and trained a maskrcnn model successfully, the original image_size is “(832, 1344)”? It’s larger than my current setting. Are you sure it’s the image size problem?

To narrow down, please double check below.

  1. Do your training meet below requirement?
  • Input size : C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)
  • Image format : JPG
  • Label format : COCO detection
  1. Can you try to train with the public dataset mentioned in the jupyter notebook again?
  2. Try to reboot
  3. Try to train with a smaller network
  4. Try to train with smaller image_size

More reference for OOM issue:
Maskrcnn:

Other networks

OK. I’ll try.
Just clarify, the input size here means the input training images’ size or the image_size setting in the spec file?

  • Input size : C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)

My input jpeg image files’ width&height are larger than the image_size setting in the spec. Should I resize them in advance?

  1. image_size setting in the spec file.
  2. Do not need to resize.

BTW, for TLT 2.0, suggest you to set W, H to be multiples of 64 because of a known issue during exporting.
For TLT 3.0, there is not issue.

  1. Do your training meet below requirement? Yes. I think so. Does the data amount affects the OOM issue?
  • Input size : C * W * H (where C = 3, W > =128, H >=128 and W, H are multiples of 32)
  • Image format : JPG
  • Label format : COCO detection
  1. Can you try to train with the public dataset mentioned in the jupyter notebook again? Tried, seems no OOM issue
  2. Try to reboot Tried
  3. Try to train with a smaller network Which perameters should I set to make a smaller network?
  4. Try to train with smaller image_size Tried

For example, use resnet18 backbone instead of resnet50.

More, can you share below files with me? I want to double check.

training_file_pattern: “/workspace/tlt-experiments/mapillary/train*.tfrecord”
validation_file_pattern: “/workspace/tlt-experiments/mapillary/val*.tfrecord”
val_json_file: “/workspace/tlt-experiments/mapillary/annotations/instances_shape_validation2020_v1.2.json”

The trian and val tfrecords are quite large amount, I picked one of each to share:
https://drive.google.com/file/d/1an9zk83PC9ZaiG3LgMjMBfUllcPe-VSx/view?usp=sharing
https://drive.google.com/file/d/1e8ABhZFA5cY_bbfWc274E-WYFXv8AkW-/view?usp=sharing

Hi. I tried to reduce the amount of my training and validation data like 1000 images for train and 500 for val. There’s no OOM error message now, but still some Dimension error. What is this error related with?

BTW, I also tried the resnet18 backbone, also had OOM and Dimension error…
I shared my lastest data as below. Would you please help to check? Thanks.

[MaskRCNN] INFO    : # ============================================= #
[MaskRCNN] INFO    :                  Start Training
[MaskRCNN] INFO    : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #

[GPU 00] Restoring pretrained weights (265 Tensors) from: /tmp/tmpudvf6gk7
[MaskRCNN] INFO    : Pretrained weights loaded with success...

2021-05-12 02:56:07.606306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-12 02:56:07.647420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt.
2021-05-12 02:56:19.211570: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2 of 4096
2021-05-12 02:56:21.431148: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3 of 4096
2021-05-12 02:56:23.471203: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-12 02:56:23.530412: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2021-05-12 02:56:23.609191: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -60 must be >= 0
2021-05-12 02:56:23.609187: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -10 must be >= 0
2021-05-12 02:56:23.609187: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -12 must be >= 0
2021-05-12 02:56:23.609229: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -4 must be >= 0
2021-05-12 02:56:23.609187: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -3 must be >= 0
2021-05-12 02:56:27.936694: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at constant_op.cc:170 : Invalid argument: Dimension -33 must be >= 0
2021-05-12 02:56:32.006527: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
/usr/local/bin/tlt-train: line 32:  3833 Killed                  tlt-train-g1 ${PYTHON_ARGS[*]}
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[45233,1],2]
  Exit code:    137
--------------------------------------------------------------------------

https://drive.google.com/file/d/1hkkb2pqtC1XxyR8eMmf2SAOj40PXzqXn/view?usp=sharing
https://drive.google.com/file/d/1bLwYALypKT7sA9t_BSXq5gcjEqiyNPV-/view?usp=sharing

Thanks for the info. Seem that there is still OOM because of above message.
I will take a look further.

Is it possible for you to try TLT 3.0-dp docker?

BTW, with the 2.0_py3 docker, did you ever train Mapillary Vistas dataset with TLT faster_rcnn network?

More, with the 2.0_py3 docker, did you run into any error when train Maskrcnn model with the coco dataset mentioned in the jupyter notebook?

I believe TLT 3.0-dp docker requires CUDA >=11.1. But my CUDA is 10.1. Are you sure there’s no such kind of OOM issue in the TLT 3.0-dp docker?

No. I didn’t try the faster_rcnn network. I just need to train an instance segmentaion model with street level data. I think maskrcnn is the best choice?

The maskrcnn model training with COCO dataset in Jupyter works good for me. I trained a model, exported, transfered and run it in Deepstream sucessfully.

Thanks for the info. I will try to reproduce your issue.

Hi,
I can reproduce the OOM error when train with one GeForce GTX 1080 Ti. I saw you mention that you are running with 3 gpus (GeForce GTX 1080). Can you confirm? Did you ever run with 3gpus?