Working with the modified paramters for 2 GPUS and the increase to 720,000 which brings another problem into focus. Getting an error message after X number of iterations it exits out. In this example it was approximatly 12,000 iterations with train_batch_size: 1. If I try and run with train_batch_size 2 get immediate out of memory error.
Tail the log file in the unpruned directory nothing unusual.
Any guidance would be appreciated as it will take lots of restarts to get 720,000 iterations where train_batch_size = 1
Tue Dec 22 14:08:14 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 28% 35C P8 6W / 180W | 570MiB / 8116MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 27% 35C P8 5W / 180W | 7MiB / 8119MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 879 G /usr/lib/xorg/Xorg 344MiB |
| 0 N/A N/A 1078 G /usr/bin/gnome-shell 222MiB |
| 1 N/A N/A 879 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
To resume training from a checkpoint, simply run the same training script. It will pick up from where it’s left.
2020-12-22 16:02:18.347718: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:18.347706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[MaskRCNN] INFO : Loading weights from /workspace/tlt-experiments/maskrcnn/exp1/experiment_dir_unpruned/model.step-60000.tlt
[MaskRCNN] INFO : Loading weights from /workspace/tlt-experiments/maskrcnn/exp1/experiment_dir_unpruned/model.step-60000.tlt
[MaskRCNN] INFO : Horovod successfully initialized …
[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 02
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Using Dataset Sharding with Horovod
2020-12-22 16:02:41.815815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-12-22 16:02:41.815822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-12-22 16:02:42.260555: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:42.260560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:42.262235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2020-12-22 16:02:42.262280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:42.262341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
2020-12-22 16:02:42.262384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:43.219502: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:02:43.219500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:02:43.244121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-22 16:02:43.244114: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-22 16:02:43.251738: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-22 16:02:43.251766: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-22 16:02:43.314964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-22 16:02:43.314953: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-22 16:02:43.351668: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-22 16:02:43.351722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-22 16:02:43.460949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:02:43.460917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:02:43.461392: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:43.461392: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:43.465569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:43.465691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:43.469044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-12-22 16:02:43.469088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_6/
2020-12-22 16:02:47.015848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:47.016443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
2020-12-22 16:02:47.016473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:47.016519: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:02:47.016538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-22 16:02:47.016554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-22 16:02:47.016571: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-22 16:02:47.016587: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-22 16:02:47.016604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:02:47.016663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:47.017231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:47.017684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-12-22 16:02:47.045055: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:47.045655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2020-12-22 16:02:47.045698: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:47.045741: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:02:47.045760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-22 16:02:47.045776: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-22 16:02:47.045792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-22 16:02:47.045809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-22 16:02:47.045826: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:02:47.045885: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:47.046482: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:47.046938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-12-22 16:02:47.047234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:47.047216: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:49.255618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-22 16:02:49.255657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-12-22 16:02:49.255664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-12-22 16:02:49.255853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:49.256314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:49.256675: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:49.256918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-22 16:02:49.256936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-12-22 16:02:49.256946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-12-22 16:02:49.257069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6561 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-12-22 16:02:49.257107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:49.257497: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:49.257848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:49.258217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7127 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 547.2 GFLOPS/image
2020-12-22 16:02:56.309606: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:56.309878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:03:00.0
2020-12-22 16:02:56.309910: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:56.309976: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:02:56.310002: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-22 16:02:56.310023: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-22 16:02:56.310043: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-22 16:02:56.310064: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-22 16:02:56.310084: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:02:56.310149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:56.310401: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:56.310610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-12-22 16:02:56.310631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-22 16:02:56.310638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-12-22 16:02:56.310643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-12-22 16:02:56.310708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:56.310957: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:56.311173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7127 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1)
2020-12-22 16:02:59.242501: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:59.242819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
2020-12-22 16:02:59.242859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-22 16:02:59.242906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:02:59.242926: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-22 16:02:59.242949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-22 16:02:59.242966: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-22 16:02:59.242983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-22 16:02:59.243000: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:02:59.243059: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:59.243344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:59.243563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-12-22 16:02:59.243587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-22 16:02:59.243594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-12-22 16:02:59.243602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-12-22 16:02:59.243683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:59.243953: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-22 16:02:59.244180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6561 MB memory) → physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-12-22 16:03:01.747991: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
2020-12-22 16:03:04.204327: W tensorflow/core/framework/dataset.cc:382] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
[MaskRCNN] INFO : ============================ GIT REPOSITORY ============================
[MaskRCNN] INFO : BRANCH NAME:
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[MaskRCNN] INFO : ============================ MODEL STATISTICS ===========================
[MaskRCNN] INFO : # Model Weights: 44,507,633
[MaskRCNN] INFO : # Trainable Weights: 44,454,513
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[MaskRCNN] INFO : ============================ TRAINABLE VARIABLES ========================
[MaskRCNN] INFO : [#0001] conv1/kernel:0 => (7, 7, 3, 64)
[MaskRCNN] INFO : [#0002] bn_conv1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0003] bn_conv1/beta:0 => (64,)
[MaskRCNN] INFO : [#0004] block_1a_conv_1/kernel:0 => (1, 1, 64, 64)
[MaskRCNN] INFO : [#0005] block_1a_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0006] block_1a_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0007] block_1a_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0008] block_1a_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0009] block_1a_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0010] block_1a_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0011] block_1a_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0012] block_1a_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0013] block_1a_conv_shortcut/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0014] block_1a_bn_shortcut/gamma:0 => (256,)
[MaskRCNN] INFO : [#0015] block_1a_bn_shortcut/beta:0 => (256,)
[MaskRCNN] INFO : [#0016] block_1b_conv_1/kernel:0 => (1, 1, 256, 64)
[MaskRCNN] INFO : [#0017] block_1b_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0018] block_1b_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0019] block_1b_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0020] block_1b_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0021] block_1b_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0022] block_1b_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0023] block_1b_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0024] block_1b_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0025] block_1c_conv_1/kernel:0 => (1, 1, 256, 64)
[MaskRCNN] INFO : [#0026] block_1c_bn_1/gamma:0 => (64,)
[MaskRCNN] INFO : [#0027] block_1c_bn_1/beta:0 => (64,)
[MaskRCNN] INFO : [#0028] block_1c_conv_2/kernel:0 => (3, 3, 64, 64)
[MaskRCNN] INFO : [#0029] block_1c_bn_2/gamma:0 => (64,)
[MaskRCNN] INFO : [#0030] block_1c_bn_2/beta:0 => (64,)
[MaskRCNN] INFO : [#0031] block_1c_conv_3/kernel:0 => (1, 1, 64, 256)
[MaskRCNN] INFO : [#0032] block_1c_bn_3/gamma:0 => (256,)
[MaskRCNN] INFO : [#0033] block_1c_bn_3/beta:0 => (256,)
[MaskRCNN] INFO : [#0034] block_2a_conv_1/kernel:0 => (1, 1, 256, 128)
[MaskRCNN] INFO : [#0035] block_2a_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0036] block_2a_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0037] block_2a_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0038] block_2a_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0039] block_2a_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0040] block_2a_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0041] block_2a_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0042] block_2a_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0043] block_2a_conv_shortcut/kernel:0 => (1, 1, 256, 512)
[MaskRCNN] INFO : [#0044] block_2a_bn_shortcut/gamma:0 => (512,)
[MaskRCNN] INFO : [#0045] block_2a_bn_shortcut/beta:0 => (512,)
[MaskRCNN] INFO : [#0046] block_2b_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0047] block_2b_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0048] block_2b_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0049] block_2b_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0050] block_2b_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0051] block_2b_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0052] block_2b_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0053] block_2b_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0054] block_2b_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0055] block_2c_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0056] block_2c_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0057] block_2c_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0058] block_2c_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0059] block_2c_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0060] block_2c_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0061] block_2c_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0062] block_2c_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0063] block_2c_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0064] block_2d_conv_1/kernel:0 => (1, 1, 512, 128)
[MaskRCNN] INFO : [#0065] block_2d_bn_1/gamma:0 => (128,)
[MaskRCNN] INFO : [#0066] block_2d_bn_1/beta:0 => (128,)
[MaskRCNN] INFO : [#0067] block_2d_conv_2/kernel:0 => (3, 3, 128, 128)
[MaskRCNN] INFO : [#0068] block_2d_bn_2/gamma:0 => (128,)
[MaskRCNN] INFO : [#0069] block_2d_bn_2/beta:0 => (128,)
[MaskRCNN] INFO : [#0070] block_2d_conv_3/kernel:0 => (1, 1, 128, 512)
[MaskRCNN] INFO : [#0071] block_2d_bn_3/gamma:0 => (512,)
[MaskRCNN] INFO : [#0072] block_2d_bn_3/beta:0 => (512,)
[MaskRCNN] INFO : [#0073] block_3a_conv_1/kernel:0 => (1, 1, 512, 256)
[MaskRCNN] INFO : [#0074] block_3a_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0075] block_3a_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0076] block_3a_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0077] block_3a_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0078] block_3a_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0079] block_3a_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0080] block_3a_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0081] block_3a_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0082] block_3a_conv_shortcut/kernel:0 => (1, 1, 512, 1024)
[MaskRCNN] INFO : [#0083] block_3a_bn_shortcut/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0084] block_3a_bn_shortcut/beta:0 => (1024,)
[MaskRCNN] INFO : [#0085] block_3b_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0086] block_3b_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0087] block_3b_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0088] block_3b_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0089] block_3b_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0090] block_3b_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0091] block_3b_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0092] block_3b_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0093] block_3b_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0094] block_3c_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0095] block_3c_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0096] block_3c_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0097] block_3c_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0098] block_3c_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0099] block_3c_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0100] block_3c_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0101] block_3c_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0102] block_3c_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0103] block_3d_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0104] block_3d_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0105] block_3d_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0106] block_3d_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0107] block_3d_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0108] block_3d_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0109] block_3d_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0110] block_3d_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0111] block_3d_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0112] block_3e_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0113] block_3e_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0114] block_3e_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0115] block_3e_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0116] block_3e_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0117] block_3e_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0118] block_3e_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0119] block_3e_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0120] block_3e_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0121] block_3f_conv_1/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0122] block_3f_bn_1/gamma:0 => (256,)
[MaskRCNN] INFO : [#0123] block_3f_bn_1/beta:0 => (256,)
[MaskRCNN] INFO : [#0124] block_3f_conv_2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0125] block_3f_bn_2/gamma:0 => (256,)
[MaskRCNN] INFO : [#0126] block_3f_bn_2/beta:0 => (256,)
[MaskRCNN] INFO : [#0127] block_3f_conv_3/kernel:0 => (1, 1, 256, 1024)
[MaskRCNN] INFO : [#0128] block_3f_bn_3/gamma:0 => (1024,)
[MaskRCNN] INFO : [#0129] block_3f_bn_3/beta:0 => (1024,)
[MaskRCNN] INFO : [#0130] block_4a_conv_1/kernel:0 => (1, 1, 1024, 512)
[MaskRCNN] INFO : [#0131] block_4a_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0132] block_4a_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0133] block_4a_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0134] block_4a_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0135] block_4a_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0136] block_4a_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0137] block_4a_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0138] block_4a_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0139] block_4a_conv_shortcut/kernel:0 => (1, 1, 1024, 2048)
[MaskRCNN] INFO : [#0140] block_4a_bn_shortcut/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0141] block_4a_bn_shortcut/beta:0 => (2048,)
[MaskRCNN] INFO : [#0142] block_4b_conv_1/kernel:0 => (1, 1, 2048, 512)
[MaskRCNN] INFO : [#0143] block_4b_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0144] block_4b_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0145] block_4b_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0146] block_4b_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0147] block_4b_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0148] block_4b_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0149] block_4b_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0150] block_4b_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0151] block_4c_conv_1/kernel:0 => (1, 1, 2048, 512)
[MaskRCNN] INFO : [#0152] block_4c_bn_1/gamma:0 => (512,)
[MaskRCNN] INFO : [#0153] block_4c_bn_1/beta:0 => (512,)
[MaskRCNN] INFO : [#0154] block_4c_conv_2/kernel:0 => (3, 3, 512, 512)
[MaskRCNN] INFO : [#0155] block_4c_bn_2/gamma:0 => (512,)
[MaskRCNN] INFO : [#0156] block_4c_bn_2/beta:0 => (512,)
[MaskRCNN] INFO : [#0157] block_4c_conv_3/kernel:0 => (1, 1, 512, 2048)
[MaskRCNN] INFO : [#0158] block_4c_bn_3/gamma:0 => (2048,)
[MaskRCNN] INFO : [#0159] block_4c_bn_3/beta:0 => (2048,)
[MaskRCNN] INFO : [#0160] fpn/l2/kernel:0 => (1, 1, 256, 256)
[MaskRCNN] INFO : [#0161] fpn/l2/bias:0 => (256,)
[MaskRCNN] INFO : [#0162] fpn/l3/kernel:0 => (1, 1, 512, 256)
[MaskRCNN] INFO : [#0163] fpn/l3/bias:0 => (256,)
[MaskRCNN] INFO : [#0164] fpn/l4/kernel:0 => (1, 1, 1024, 256)
[MaskRCNN] INFO : [#0165] fpn/l4/bias:0 => (256,)
[MaskRCNN] INFO : [#0166] fpn/l5/kernel:0 => (1, 1, 2048, 256)
[MaskRCNN] INFO : [#0167] fpn/l5/bias:0 => (256,)
[MaskRCNN] INFO : [#0168] fpn/post_hoc_d2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0169] fpn/post_hoc_d2/bias:0 => (256,)
[MaskRCNN] INFO : [#0170] fpn/post_hoc_d3/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0171] fpn/post_hoc_d3/bias:0 => (256,)
[MaskRCNN] INFO : [#0172] fpn/post_hoc_d4/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0173] fpn/post_hoc_d4/bias:0 => (256,)
[MaskRCNN] INFO : [#0174] fpn/post_hoc_d5/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0175] fpn/post_hoc_d5/bias:0 => (256,)
[MaskRCNN] INFO : [#0176] rpn_head/rpn/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0177] rpn_head/rpn/bias:0 => (256,)
[MaskRCNN] INFO : [#0178] rpn_head/rpn-class/kernel:0 => (1, 1, 256, 3)
[MaskRCNN] INFO : [#0179] rpn_head/rpn-class/bias:0 => (3,)
[MaskRCNN] INFO : [#0180] rpn_head/rpn-box/kernel:0 => (1, 1, 256, 12)
[MaskRCNN] INFO : [#0181] rpn_head/rpn-box/bias:0 => (12,)
[MaskRCNN] INFO : [#0182] box_head/fc6/kernel:0 => (12544, 1024)
[MaskRCNN] INFO : [#0183] box_head/fc6/bias:0 => (1024,)
[MaskRCNN] INFO : [#0184] box_head/fc7/kernel:0 => (1024, 1024)
[MaskRCNN] INFO : [#0185] box_head/fc7/bias:0 => (1024,)
[MaskRCNN] INFO : [#0186] box_head/class-predict/kernel:0 => (1024, 91)
[MaskRCNN] INFO : [#0187] box_head/class-predict/bias:0 => (91,)
[MaskRCNN] INFO : [#0188] box_head/box-predict/kernel:0 => (1024, 364)
[MaskRCNN] INFO : [#0189] box_head/box-predict/bias:0 => (364,)
[MaskRCNN] INFO : [#0190] mask_head/mask-conv-l0/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0191] mask_head/mask-conv-l0/bias:0 => (256,)
[MaskRCNN] INFO : [#0192] mask_head/mask-conv-l1/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0193] mask_head/mask-conv-l1/bias:0 => (256,)
[MaskRCNN] INFO : [#0194] mask_head/mask-conv-l2/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0195] mask_head/mask-conv-l2/bias:0 => (256,)
[MaskRCNN] INFO : [#0196] mask_head/mask-conv-l3/kernel:0 => (3, 3, 256, 256)
[MaskRCNN] INFO : [#0197] mask_head/mask-conv-l3/bias:0 => (256,)
[MaskRCNN] INFO : [#0198] mask_head/conv5-mask/kernel:0 => (2, 2, 256, 256)
[MaskRCNN] INFO : [#0199] mask_head/conv5-mask/bias:0 => (256,)
[MaskRCNN] INFO : [#0200] mask_head/mask_fcn_logits/kernel:0 => (1, 1, 256, 91)
[MaskRCNN] INFO : [#0201] mask_head/mask_fcn_logits/bias:0 => (91,)
[MaskRCNN] INFO : %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[MaskRCNN] INFO : # ============================================= #
[MaskRCNN] INFO : Start Training
[MaskRCNN] INFO : # %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #
[GPU 00] Restoring pretrained weights (307 Tensors) from: /tmp/tmpwf8z9gf4/model.ckpt-60000
[MaskRCNN] INFO : Pretrained weights loaded with success…
2020-12-22 16:03:09.270489: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:03:20.011847: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 117 of 4096
[MaskRCNN] INFO : Saving checkpoints for 60000 into /workspace/tlt-experiments/maskrcnn/exp1/experiment_dir_unpruned/model.step-60000.tlt.
2020-12-22 16:03:30.164571: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 3248 of 4096
2020-12-22 16:03:30.809275: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2020-12-22 16:03:32.451011: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-22 16:03:38.965598: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-22 16:03:49.359992: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 38 of 4096
2020-12-22 16:03:59.424414: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 2812 of 4096
2020-12-22 16:04:01.651668: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:195] Shuffle buffer filled.
2020-12-22 16:04:02.709640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[MaskRCNN] INFO : timestamp: 1608653057.8927548
[MaskRCNN] INFO : iteration: 60005
DLL 2020-12-22 16:04:17.915451 - iteration : 60005
[MaskRCNN] INFO : throughput: 1.0 samples/sec
DLL 2020-12-22 16:04:17.915634 - Iteration: 60005 throughput : 0.9736721380764456
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.24076
DLL 2020-12-22 16:04:17.916156 - Iteration: 60005 FastRCNN box loss : 0.24076
[MaskRCNN] INFO : FastRCNN class loss: 0.07063
DLL 2020-12-22 16:04:17.916258 - Iteration: 60005 FastRCNN class loss : 0.07063
[MaskRCNN] INFO : FastRCNN total loss: 0.31138
DLL 2020-12-22 16:04:17.916354 - Iteration: 60005 FastRCNN total loss : 0.31138
[MaskRCNN] INFO : L2 loss: 0.36905
DLL 2020-12-22 16:04:17.916449 - Iteration: 60005 L2 loss : 0.36905
[MaskRCNN] INFO : Learning rate: 0.005
DLL 2020-12-22 16:04:17.916545 - Iteration: 60005 Learning rate : 0.005
[MaskRCNN] INFO : Mask loss: 0.37652
DLL 2020-12-22 16:04:17.916638 - Iteration: 60005 Mask loss : 0.37652
[MaskRCNN] INFO : RPN box loss: 0.0137
DLL 2020-12-22 16:04:17.916729 - Iteration: 60005 RPN box loss : 0.0137
[MaskRCNN] INFO : RPN score loss: 0.01402
DLL 2020-12-22 16:04:17.916819 - Iteration: 60005 RPN score loss : 0.01402
[MaskRCNN] INFO : RPN total loss: 0.02772
DLL 2020-12-22 16:04:17.916908 - Iteration: 60005 RPN total loss : 0.02772
[MaskRCNN] INFO : Total loss: 1.08468
DLL 2020-12-22 16:04:17.916998 - Iteration: 60005 Total loss : 1.08468
[MaskRCNN] INFO : timestamp: 1608653060.737262
[MaskRCNN] INFO : iteration: 60010
DLL 2020-12-22 16:04:20.737456 - iteration : 60010
[MaskRCNN] INFO : throughput: 2.3 samples/sec
DLL 2020-12-22 16:04:20.737564 - Iteration: 60010 throughput : 2.3231455686435747
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.51139
DLL 2020-12-22 16:04:20.737988 - Iteration: 60010 FastRCNN box loss : 0.51139
[MaskRCNN] INFO : FastRCNN class loss: 0.34193
DLL 2020-12-22 16:04:20.738092 - Iteration: 60010 FastRCNN class loss : 0.34193
[MaskRCNN] INFO : FastRCNN total loss: 0.85332
DLL 2020-12-22 16:04:20.738192 - Iteration: 60010 FastRCNN total loss : 0.85332
[MaskRCNN] INFO : L2 loss: 0.36905
DLL 2020-12-22 16:04:20.738287 - Iteration: 60010 L2 loss : 0.36905
[MaskRCNN] INFO : Learning rate: 0.005
DLL 2020-12-22 16:04:20.738436 - Iteration: 60010 Learning rate : 0.005
[MaskRCNN] INFO : Mask loss: 0.52758
DLL 2020-12-22 16:04:20.738553 - Iteration: 60010 Mask loss : 0.52758
[MaskRCNN] INFO : RPN box loss: 0.05652
DLL 2020-12-22 16:04:20.738649 - Iteration: 60010 RPN box loss : 0.05652
[MaskRCNN] INFO : RPN score loss: 0.0378
DLL 2020-12-22 16:04:20.738742 - Iteration: 60010 RPN score loss : 0.0378
[MaskRCNN] INFO : RPN total loss: 0.09432
DLL 2020-12-22 16:04:20.738835 - Iteration: 60010 RPN total loss : 0.09432
[MaskRCNN] INFO : Total loss: 1.84427
DLL 2020-12-22 16:04:20.738927 - Iteration: 60010 Total loss : 1.84427
A COUPLE ITERATIONS AT THE END BEFORE THE ERROR
[MaskRCNN] INFO : timestamp: 1608659858.2925525
[MaskRCNN] INFO : iteration: 72940
DLL 2020-12-22 17:57:38.405440 - iteration : 72940
[MaskRCNN] INFO : throughput: 3.9 samples/sec
DLL 2020-12-22 17:57:38.532309 - Iteration: 72940 throughput : 3.872481777341103
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.34174
DLL 2020-12-22 17:57:38.578164 - Iteration: 72940 FastRCNN box loss : 0.34174
[MaskRCNN] INFO : FastRCNN class loss: 0.7384
DLL 2020-12-22 17:57:38.579095 - Iteration: 72940 FastRCNN class loss : 0.7384
[MaskRCNN] INFO : FastRCNN total loss: 1.08015
DLL 2020-12-22 17:57:38.579924 - Iteration: 72940 FastRCNN total loss : 1.08015
[MaskRCNN] INFO : L2 loss: 0.36534
DLL 2020-12-22 17:57:38.581276 - Iteration: 72940 L2 loss : 0.36534
[MaskRCNN] INFO : Learning rate: 0.005
DLL 2020-12-22 17:57:38.582406 - Iteration: 72940 Learning rate : 0.005
[MaskRCNN] INFO : Mask loss: 0.48719
DLL 2020-12-22 17:57:38.583263 - Iteration: 72940 Mask loss : 0.48719
[MaskRCNN] INFO : RPN box loss: 0.03276
DLL 2020-12-22 17:57:38.583972 - Iteration: 72940 RPN box loss : 0.03276
[MaskRCNN] INFO : RPN score loss: 0.04517
DLL 2020-12-22 17:57:38.584498 - Iteration: 72940 RPN score loss : 0.04517
[MaskRCNN] INFO : RPN total loss: 0.07792
DLL 2020-12-22 17:57:38.585049 - Iteration: 72940 RPN total loss : 0.07792
[MaskRCNN] INFO : Total loss: 2.0106
DLL 2020-12-22 17:57:38.585582 - Iteration: 72940 Total loss : 2.0106
[MaskRCNN] INFO : timestamp: 1608660015.4845078
[MaskRCNN] INFO : iteration: 72945
DLL 2020-12-22 18:00:15.533471 - iteration : 72945
[MaskRCNN] INFO : throughput: 3.9 samples/sec
DLL 2020-12-22 18:00:15.540926 - Iteration: 72945 throughput : 3.8563356106862883
[MaskRCNN] INFO : ==================== Metrics =====================
[MaskRCNN] INFO : FastRCNN box loss: 0.53189
DLL 2020-12-22 18:00:15.791674 - Iteration: 72945 FastRCNN box loss : 0.53189
[MaskRCNN] INFO : FastRCNN class loss: 0.24809
DLL 2020-12-22 18:00:15.941710 - Iteration: 72945 FastRCNN class loss : 0.24809
[MaskRCNN] INFO : FastRCNN total loss: 0.77998
DLL 2020-12-22 18:00:15.945204 - Iteration: 72945 FastRCNN total loss : 0.77998
[MaskRCNN] INFO : L2 loss: 0.36534
DLL 2020-12-22 18:00:15.945805 - Iteration: 72945 L2 loss : 0.36534
[MaskRCNN] INFO : Learning rate: 0.005
DLL 2020-12-22 18:00:15.946381 - Iteration: 72945 Learning rate : 0.005
[MaskRCNN] INFO : Mask loss: 0.45454
DLL 2020-12-22 18:00:15.946936 - Iteration: 72945 Mask loss : 0.45454
[MaskRCNN] INFO : RPN box loss: 0.02068
DLL 2020-12-22 18:00:15.947430 - Iteration: 72945 RPN box loss : 0.02068
[MaskRCNN] INFO : RPN score loss: 0.04317
DLL 2020-12-22 18:00:15.947910 - Iteration: 72945 RPN score loss : 0.04317
[MaskRCNN] INFO : RPN total loss: 0.06384
DLL 2020-12-22 18:00:15.948354 - Iteration: 72945 RPN total loss : 0.06384
[MaskRCNN] INFO : Total loss: 1.66371
DLL 2020-12-22 18:00:15.948847 - Iteration: 72945 Total loss : 1.66371
/usr/local/bin/tlt-train: line 32: 13439 Killed tlt-train-g1 ${PYTHON_ARGS[*]}
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[13466,1],0]
Exit code: 137