@Morganh I followed your instruction. I assume this is because of a checkpoint I forgot to delete, but have a look:
root@eda82919eac9:/data# tlt-train detectnet_v2 -e ./train.txt -r ./trained -k KEY
Using TensorFlow backend.
2021-01-16 15:16:05.507174: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
--------------------------------------------------------------------------
[[15188,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: eda82919eac9
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2021-01-16 15:16:08.114791: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-16 15:16:08.138756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.139642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
2021-01-16 15:16:08.139681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:16:08.139761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-16 15:16:08.140974: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-16 15:16:08.141370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-16 15:16:08.143093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-16 15:16:08.144344: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-16 15:16:08.144428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-16 15:16:08.144561: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.145483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.146308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-01-16 15:16:08.146355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:16:08.901755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-16 15:16:08.901806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-01-16 15:16:08.901822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-01-16 15:16:08.902105: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.903116: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.904017: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.904840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13906 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2021-01-16 15:16:08,905 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at ./train.txt.
2021-01-16 15:16:08,907 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from ./train.txt
2021-01-16 15:16:09,585 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 7073 samples with a batch size of 16; each epoch will therefore take one extra step.
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 3, 512, 512) 0
__________________________________________________________________________________________________
conv1 (Conv2D) (None, 64, 256, 256) 9472 input_1[0][0]
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization) (None, 64, 256, 256) 256 conv1[0][0]
__________________________________________________________________________________________________
activation_1 (Activation) (None, 64, 256, 256) 0 bn_conv1[0][0]
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D) (None, 64, 128, 128) 4160 activation_1[0][0]
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 128, 128) 256 block_1a_conv_1[0][0]
__________________________________________________________________________________________________
block_1a_relu_1 (Activation) (None, 64, 128, 128) 0 block_1a_bn_1[0][0]
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D) (None, 64, 128, 128) 36928 block_1a_relu_1[0][0]
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 128, 128) 256 block_1a_conv_2[0][0]
__________________________________________________________________________________________________
block_1a_relu_2 (Activation) (None, 64, 128, 128) 0 block_1a_bn_2[0][0]
__________________________________________________________________________________________________
block_1a_conv_3 (Conv2D) (None, 256, 128, 128 16640 block_1a_relu_2[0][0]
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 256, 128, 128 16640 activation_1[0][0]
__________________________________________________________________________________________________
block_1a_bn_3 (BatchNormalizati (None, 256, 128, 128 1024 block_1a_conv_3[0][0]
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 256, 128, 128 1024 block_1a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_1 (Add) (None, 256, 128, 128 0 block_1a_bn_3[0][0]
block_1a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_1a_relu (Activation) (None, 256, 128, 128 0 add_1[0][0]
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D) (None, 64, 128, 128) 16448 block_1a_relu[0][0]
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 128, 128) 256 block_1b_conv_1[0][0]
__________________________________________________________________________________________________
block_1b_relu_1 (Activation) (None, 64, 128, 128) 0 block_1b_bn_1[0][0]
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D) (None, 64, 128, 128) 36928 block_1b_relu_1[0][0]
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 128, 128) 256 block_1b_conv_2[0][0]
__________________________________________________________________________________________________
block_1b_relu_2 (Activation) (None, 64, 128, 128) 0 block_1b_bn_2[0][0]
__________________________________________________________________________________________________
block_1b_conv_3 (Conv2D) (None, 256, 128, 128 16640 block_1b_relu_2[0][0]
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (None, 256, 128, 128 65792 block_1a_relu[0][0]
__________________________________________________________________________________________________
block_1b_bn_3 (BatchNormalizati (None, 256, 128, 128 1024 block_1b_conv_3[0][0]
__________________________________________________________________________________________________
block_1b_bn_shortcut (BatchNorm (None, 256, 128, 128 1024 block_1b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_2 (Add) (None, 256, 128, 128 0 block_1b_bn_3[0][0]
block_1b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_1b_relu (Activation) (None, 256, 128, 128 0 add_2[0][0]
__________________________________________________________________________________________________
block_1c_conv_1 (Conv2D) (None, 64, 128, 128) 16448 block_1b_relu[0][0]
__________________________________________________________________________________________________
block_1c_bn_1 (BatchNormalizati (None, 64, 128, 128) 256 block_1c_conv_1[0][0]
__________________________________________________________________________________________________
block_1c_relu_1 (Activation) (None, 64, 128, 128) 0 block_1c_bn_1[0][0]
__________________________________________________________________________________________________
block_1c_conv_2 (Conv2D) (None, 64, 128, 128) 36928 block_1c_relu_1[0][0]
__________________________________________________________________________________________________
block_1c_bn_2 (BatchNormalizati (None, 64, 128, 128) 256 block_1c_conv_2[0][0]
__________________________________________________________________________________________________
block_1c_relu_2 (Activation) (None, 64, 128, 128) 0 block_1c_bn_2[0][0]
__________________________________________________________________________________________________
block_1c_conv_3 (Conv2D) (None, 256, 128, 128 16640 block_1c_relu_2[0][0]
__________________________________________________________________________________________________
block_1c_conv_shortcut (Conv2D) (None, 256, 128, 128 65792 block_1b_relu[0][0]
__________________________________________________________________________________________________
block_1c_bn_3 (BatchNormalizati (None, 256, 128, 128 1024 block_1c_conv_3[0][0]
__________________________________________________________________________________________________
block_1c_bn_shortcut (BatchNorm (None, 256, 128, 128 1024 block_1c_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_3 (Add) (None, 256, 128, 128 0 block_1c_bn_3[0][0]
block_1c_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_1c_relu (Activation) (None, 256, 128, 128 0 add_3[0][0]
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D) (None, 128, 64, 64) 32896 block_1c_relu[0][0]
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 64, 64) 512 block_2a_conv_1[0][0]
__________________________________________________________________________________________________
block_2a_relu_1 (Activation) (None, 128, 64, 64) 0 block_2a_bn_1[0][0]
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D) (None, 128, 64, 64) 147584 block_2a_relu_1[0][0]
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 64, 64) 512 block_2a_conv_2[0][0]
__________________________________________________________________________________________________
block_2a_relu_2 (Activation) (None, 128, 64, 64) 0 block_2a_bn_2[0][0]
__________________________________________________________________________________________________
block_2a_conv_3 (Conv2D) (None, 512, 64, 64) 66048 block_2a_relu_2[0][0]
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 512, 64, 64) 131584 block_1c_relu[0][0]
__________________________________________________________________________________________________
block_2a_bn_3 (BatchNormalizati (None, 512, 64, 64) 2048 block_2a_conv_3[0][0]
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 512, 64, 64) 2048 block_2a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_4 (Add) (None, 512, 64, 64) 0 block_2a_bn_3[0][0]
block_2a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_2a_relu (Activation) (None, 512, 64, 64) 0 add_4[0][0]
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D) (None, 128, 64, 64) 65664 block_2a_relu[0][0]
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 64, 64) 512 block_2b_conv_1[0][0]
__________________________________________________________________________________________________
block_2b_relu_1 (Activation) (None, 128, 64, 64) 0 block_2b_bn_1[0][0]
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D) (None, 128, 64, 64) 147584 block_2b_relu_1[0][0]
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 64, 64) 512 block_2b_conv_2[0][0]
__________________________________________________________________________________________________
block_2b_relu_2 (Activation) (None, 128, 64, 64) 0 block_2b_bn_2[0][0]
__________________________________________________________________________________________________
block_2b_conv_3 (Conv2D) (None, 512, 64, 64) 66048 block_2b_relu_2[0][0]
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (None, 512, 64, 64) 262656 block_2a_relu[0][0]
__________________________________________________________________________________________________
block_2b_bn_3 (BatchNormalizati (None, 512, 64, 64) 2048 block_2b_conv_3[0][0]
__________________________________________________________________________________________________
block_2b_bn_shortcut (BatchNorm (None, 512, 64, 64) 2048 block_2b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_5 (Add) (None, 512, 64, 64) 0 block_2b_bn_3[0][0]
block_2b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_2b_relu (Activation) (None, 512, 64, 64) 0 add_5[0][0]
__________________________________________________________________________________________________
block_2c_conv_1 (Conv2D) (None, 128, 64, 64) 65664 block_2b_relu[0][0]
__________________________________________________________________________________________________
block_2c_bn_1 (BatchNormalizati (None, 128, 64, 64) 512 block_2c_conv_1[0][0]
__________________________________________________________________________________________________
block_2c_relu_1 (Activation) (None, 128, 64, 64) 0 block_2c_bn_1[0][0]
__________________________________________________________________________________________________
block_2c_conv_2 (Conv2D) (None, 128, 64, 64) 147584 block_2c_relu_1[0][0]
__________________________________________________________________________________________________
block_2c_bn_2 (BatchNormalizati (None, 128, 64, 64) 512 block_2c_conv_2[0][0]
__________________________________________________________________________________________________
block_2c_relu_2 (Activation) (None, 128, 64, 64) 0 block_2c_bn_2[0][0]
__________________________________________________________________________________________________
block_2c_conv_3 (Conv2D) (None, 512, 64, 64) 66048 block_2c_relu_2[0][0]
__________________________________________________________________________________________________
block_2c_conv_shortcut (Conv2D) (None, 512, 64, 64) 262656 block_2b_relu[0][0]
__________________________________________________________________________________________________
block_2c_bn_3 (BatchNormalizati (None, 512, 64, 64) 2048 block_2c_conv_3[0][0]
__________________________________________________________________________________________________
block_2c_bn_shortcut (BatchNorm (None, 512, 64, 64) 2048 block_2c_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_6 (Add) (None, 512, 64, 64) 0 block_2c_bn_3[0][0]
block_2c_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_2c_relu (Activation) (None, 512, 64, 64) 0 add_6[0][0]
__________________________________________________________________________________________________
block_2d_conv_1 (Conv2D) (None, 128, 64, 64) 65664 block_2c_relu[0][0]
__________________________________________________________________________________________________
block_2d_bn_1 (BatchNormalizati (None, 128, 64, 64) 512 block_2d_conv_1[0][0]
__________________________________________________________________________________________________
block_2d_relu_1 (Activation) (None, 128, 64, 64) 0 block_2d_bn_1[0][0]
__________________________________________________________________________________________________
block_2d_conv_2 (Conv2D) (None, 128, 64, 64) 147584 block_2d_relu_1[0][0]
__________________________________________________________________________________________________
block_2d_bn_2 (BatchNormalizati (None, 128, 64, 64) 512 block_2d_conv_2[0][0]
__________________________________________________________________________________________________
block_2d_relu_2 (Activation) (None, 128, 64, 64) 0 block_2d_bn_2[0][0]
__________________________________________________________________________________________________
block_2d_conv_3 (Conv2D) (None, 512, 64, 64) 66048 block_2d_relu_2[0][0]
__________________________________________________________________________________________________
block_2d_conv_shortcut (Conv2D) (None, 512, 64, 64) 262656 block_2c_relu[0][0]
__________________________________________________________________________________________________
block_2d_bn_3 (BatchNormalizati (None, 512, 64, 64) 2048 block_2d_conv_3[0][0]
__________________________________________________________________________________________________
block_2d_bn_shortcut (BatchNorm (None, 512, 64, 64) 2048 block_2d_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_7 (Add) (None, 512, 64, 64) 0 block_2d_bn_3[0][0]
block_2d_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_2d_relu (Activation) (None, 512, 64, 64) 0 add_7[0][0]
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D) (None, 256, 32, 32) 131328 block_2d_relu[0][0]
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 32, 32) 1024 block_3a_conv_1[0][0]
__________________________________________________________________________________________________
block_3a_relu_1 (Activation) (None, 256, 32, 32) 0 block_3a_bn_1[0][0]
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D) (None, 256, 32, 32) 590080 block_3a_relu_1[0][0]
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 32, 32) 1024 block_3a_conv_2[0][0]
__________________________________________________________________________________________________
block_3a_relu_2 (Activation) (None, 256, 32, 32) 0 block_3a_bn_2[0][0]
__________________________________________________________________________________________________
block_3a_conv_3 (Conv2D) (None, 1024, 32, 32) 263168 block_3a_relu_2[0][0]
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 1024, 32, 32) 525312 block_2d_relu[0][0]
__________________________________________________________________________________________________
block_3a_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096 block_3a_conv_3[0][0]
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096 block_3a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_8 (Add) (None, 1024, 32, 32) 0 block_3a_bn_3[0][0]
block_3a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3a_relu (Activation) (None, 1024, 32, 32) 0 add_8[0][0]
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D) (None, 256, 32, 32) 262400 block_3a_relu[0][0]
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 32, 32) 1024 block_3b_conv_1[0][0]
__________________________________________________________________________________________________
block_3b_relu_1 (Activation) (None, 256, 32, 32) 0 block_3b_bn_1[0][0]
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D) (None, 256, 32, 32) 590080 block_3b_relu_1[0][0]
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 32, 32) 1024 block_3b_conv_2[0][0]
__________________________________________________________________________________________________
block_3b_relu_2 (Activation) (None, 256, 32, 32) 0 block_3b_bn_2[0][0]
__________________________________________________________________________________________________
block_3b_conv_3 (Conv2D) (None, 1024, 32, 32) 263168 block_3b_relu_2[0][0]
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600 block_3a_relu[0][0]
__________________________________________________________________________________________________
block_3b_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096 block_3b_conv_3[0][0]
__________________________________________________________________________________________________
block_3b_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096 block_3b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_9 (Add) (None, 1024, 32, 32) 0 block_3b_bn_3[0][0]
block_3b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3b_relu (Activation) (None, 1024, 32, 32) 0 add_9[0][0]
__________________________________________________________________________________________________
block_3c_conv_1 (Conv2D) (None, 256, 32, 32) 262400 block_3b_relu[0][0]
__________________________________________________________________________________________________
block_3c_bn_1 (BatchNormalizati (None, 256, 32, 32) 1024 block_3c_conv_1[0][0]
__________________________________________________________________________________________________
block_3c_relu_1 (Activation) (None, 256, 32, 32) 0 block_3c_bn_1[0][0]
__________________________________________________________________________________________________
block_3c_conv_2 (Conv2D) (None, 256, 32, 32) 590080 block_3c_relu_1[0][0]
__________________________________________________________________________________________________
block_3c_bn_2 (BatchNormalizati (None, 256, 32, 32) 1024 block_3c_conv_2[0][0]
__________________________________________________________________________________________________
block_3c_relu_2 (Activation) (None, 256, 32, 32) 0 block_3c_bn_2[0][0]
__________________________________________________________________________________________________
block_3c_conv_3 (Conv2D) (None, 1024, 32, 32) 263168 block_3c_relu_2[0][0]
__________________________________________________________________________________________________
block_3c_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600 block_3b_relu[0][0]
__________________________________________________________________________________________________
block_3c_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096 block_3c_conv_3[0][0]
__________________________________________________________________________________________________
block_3c_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096 block_3c_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_10 (Add) (None, 1024, 32, 32) 0 block_3c_bn_3[0][0]
block_3c_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3c_relu (Activation) (None, 1024, 32, 32) 0 add_10[0][0]
__________________________________________________________________________________________________
block_3d_conv_1 (Conv2D) (None, 256, 32, 32) 262400 block_3c_relu[0][0]
__________________________________________________________________________________________________
block_3d_bn_1 (BatchNormalizati (None, 256, 32, 32) 1024 block_3d_conv_1[0][0]
__________________________________________________________________________________________________
block_3d_relu_1 (Activation) (None, 256, 32, 32) 0 block_3d_bn_1[0][0]
__________________________________________________________________________________________________
block_3d_conv_2 (Conv2D) (None, 256, 32, 32) 590080 block_3d_relu_1[0][0]
__________________________________________________________________________________________________
block_3d_bn_2 (BatchNormalizati (None, 256, 32, 32) 1024 block_3d_conv_2[0][0]
__________________________________________________________________________________________________
block_3d_relu_2 (Activation) (None, 256, 32, 32) 0 block_3d_bn_2[0][0]
__________________________________________________________________________________________________
block_3d_conv_3 (Conv2D) (None, 1024, 32, 32) 263168 block_3d_relu_2[0][0]
__________________________________________________________________________________________________
block_3d_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600 block_3c_relu[0][0]
__________________________________________________________________________________________________
block_3d_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096 block_3d_conv_3[0][0]
__________________________________________________________________________________________________
block_3d_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096 block_3d_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_11 (Add) (None, 1024, 32, 32) 0 block_3d_bn_3[0][0]
block_3d_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3d_relu (Activation) (None, 1024, 32, 32) 0 add_11[0][0]
__________________________________________________________________________________________________
block_3e_conv_1 (Conv2D) (None, 256, 32, 32) 262400 block_3d_relu[0][0]
__________________________________________________________________________________________________
block_3e_bn_1 (BatchNormalizati (None, 256, 32, 32) 1024 block_3e_conv_1[0][0]
__________________________________________________________________________________________________
block_3e_relu_1 (Activation) (None, 256, 32, 32) 0 block_3e_bn_1[0][0]
__________________________________________________________________________________________________
block_3e_conv_2 (Conv2D) (None, 256, 32, 32) 590080 block_3e_relu_1[0][0]
__________________________________________________________________________________________________
block_3e_bn_2 (BatchNormalizati (None, 256, 32, 32) 1024 block_3e_conv_2[0][0]
__________________________________________________________________________________________________
block_3e_relu_2 (Activation) (None, 256, 32, 32) 0 block_3e_bn_2[0][0]
__________________________________________________________________________________________________
block_3e_conv_3 (Conv2D) (None, 1024, 32, 32) 263168 block_3e_relu_2[0][0]
__________________________________________________________________________________________________
block_3e_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600 block_3d_relu[0][0]
__________________________________________________________________________________________________
block_3e_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096 block_3e_conv_3[0][0]
__________________________________________________________________________________________________
block_3e_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096 block_3e_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_12 (Add) (None, 1024, 32, 32) 0 block_3e_bn_3[0][0]
block_3e_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3e_relu (Activation) (None, 1024, 32, 32) 0 add_12[0][0]
__________________________________________________________________________________________________
block_3f_conv_1 (Conv2D) (None, 256, 32, 32) 262400 block_3e_relu[0][0]
__________________________________________________________________________________________________
block_3f_bn_1 (BatchNormalizati (None, 256, 32, 32) 1024 block_3f_conv_1[0][0]
__________________________________________________________________________________________________
block_3f_relu_1 (Activation) (None, 256, 32, 32) 0 block_3f_bn_1[0][0]
__________________________________________________________________________________________________
block_3f_conv_2 (Conv2D) (None, 256, 32, 32) 590080 block_3f_relu_1[0][0]
__________________________________________________________________________________________________
block_3f_bn_2 (BatchNormalizati (None, 256, 32, 32) 1024 block_3f_conv_2[0][0]
__________________________________________________________________________________________________
block_3f_relu_2 (Activation) (None, 256, 32, 32) 0 block_3f_bn_2[0][0]
__________________________________________________________________________________________________
block_3f_conv_3 (Conv2D) (None, 1024, 32, 32) 263168 block_3f_relu_2[0][0]
__________________________________________________________________________________________________
block_3f_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600 block_3e_relu[0][0]
__________________________________________________________________________________________________
block_3f_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096 block_3f_conv_3[0][0]
__________________________________________________________________________________________________
block_3f_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096 block_3f_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_13 (Add) (None, 1024, 32, 32) 0 block_3f_bn_3[0][0]
block_3f_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_3f_relu (Activation) (None, 1024, 32, 32) 0 add_13[0][0]
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D) (None, 512, 32, 32) 524800 block_3f_relu[0][0]
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 32, 32) 2048 block_4a_conv_1[0][0]
__________________________________________________________________________________________________
block_4a_relu_1 (Activation) (None, 512, 32, 32) 0 block_4a_bn_1[0][0]
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D) (None, 512, 32, 32) 2359808 block_4a_relu_1[0][0]
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 32, 32) 2048 block_4a_conv_2[0][0]
__________________________________________________________________________________________________
block_4a_relu_2 (Activation) (None, 512, 32, 32) 0 block_4a_bn_2[0][0]
__________________________________________________________________________________________________
block_4a_conv_3 (Conv2D) (None, 2048, 32, 32) 1050624 block_4a_relu_2[0][0]
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 2048, 32, 32) 2099200 block_3f_relu[0][0]
__________________________________________________________________________________________________
block_4a_bn_3 (BatchNormalizati (None, 2048, 32, 32) 8192 block_4a_conv_3[0][0]
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 2048, 32, 32) 8192 block_4a_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_14 (Add) (None, 2048, 32, 32) 0 block_4a_bn_3[0][0]
block_4a_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_4a_relu (Activation) (None, 2048, 32, 32) 0 add_14[0][0]
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D) (None, 512, 32, 32) 1049088 block_4a_relu[0][0]
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 32, 32) 2048 block_4b_conv_1[0][0]
__________________________________________________________________________________________________
block_4b_relu_1 (Activation) (None, 512, 32, 32) 0 block_4b_bn_1[0][0]
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D) (None, 512, 32, 32) 2359808 block_4b_relu_1[0][0]
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 32, 32) 2048 block_4b_conv_2[0][0]
__________________________________________________________________________________________________
block_4b_relu_2 (Activation) (None, 512, 32, 32) 0 block_4b_bn_2[0][0]
__________________________________________________________________________________________________
block_4b_conv_3 (Conv2D) (None, 2048, 32, 32) 1050624 block_4b_relu_2[0][0]
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (None, 2048, 32, 32) 4196352 block_4a_relu[0][0]
__________________________________________________________________________________________________
block_4b_bn_3 (BatchNormalizati (None, 2048, 32, 32) 8192 block_4b_conv_3[0][0]
__________________________________________________________________________________________________
block_4b_bn_shortcut (BatchNorm (None, 2048, 32, 32) 8192 block_4b_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_15 (Add) (None, 2048, 32, 32) 0 block_4b_bn_3[0][0]
block_4b_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_4b_relu (Activation) (None, 2048, 32, 32) 0 add_15[0][0]
__________________________________________________________________________________________________
block_4c_conv_1 (Conv2D) (None, 512, 32, 32) 1049088 block_4b_relu[0][0]
__________________________________________________________________________________________________
block_4c_bn_1 (BatchNormalizati (None, 512, 32, 32) 2048 block_4c_conv_1[0][0]
__________________________________________________________________________________________________
block_4c_relu_1 (Activation) (None, 512, 32, 32) 0 block_4c_bn_1[0][0]
__________________________________________________________________________________________________
block_4c_conv_2 (Conv2D) (None, 512, 32, 32) 2359808 block_4c_relu_1[0][0]
__________________________________________________________________________________________________
block_4c_bn_2 (BatchNormalizati (None, 512, 32, 32) 2048 block_4c_conv_2[0][0]
__________________________________________________________________________________________________
block_4c_relu_2 (Activation) (None, 512, 32, 32) 0 block_4c_bn_2[0][0]
__________________________________________________________________________________________________
block_4c_conv_3 (Conv2D) (None, 2048, 32, 32) 1050624 block_4c_relu_2[0][0]
__________________________________________________________________________________________________
block_4c_conv_shortcut (Conv2D) (None, 2048, 32, 32) 4196352 block_4b_relu[0][0]
__________________________________________________________________________________________________
block_4c_bn_3 (BatchNormalizati (None, 2048, 32, 32) 8192 block_4c_conv_3[0][0]
__________________________________________________________________________________________________
block_4c_bn_shortcut (BatchNorm (None, 2048, 32, 32) 8192 block_4c_conv_shortcut[0][0]
__________________________________________________________________________________________________
add_16 (Add) (None, 2048, 32, 32) 0 block_4c_bn_3[0][0]
block_4c_bn_shortcut[0][0]
__________________________________________________________________________________________________
block_4c_relu (Activation) (None, 2048, 32, 32) 0 add_16[0][0]
__________________________________________________________________________________________________
output_bbox (Conv2D) (None, 4, 32, 32) 8196 block_4c_relu[0][0]
__________________________________________________________________________________________________
output_cov (Conv2D) (None, 1, 32, 32) 2049 block_4c_relu[0][0]
==================================================================================================
Total params: 38,203,269
Trainable params: 37,772,165
Non-trainable params: 431,104
__________________________________________________________________________________________________
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 7073, number of sources: 1, batch size per gpu: 16, steps: 443
2021-01-16 15:17:11,811 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-01-16 15:17:11.842567: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:11.843447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
2021-01-16 15:17:11.843493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:17:11.843554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-16 15:17:11.843600: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-16 15:17:11.843634: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-16 15:17:11.843660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-16 15:17:11.843699: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-16 15:17:11.843729: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-16 15:17:11.843857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:11.844749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:11.845525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-01-16 15:17:12,067 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2021-01-16 15:17:12,073 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-01-16 15:17:12,073 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2021-01-16 15:17:12,573 [INFO] iva.detectnet_v2.scripts.train: Found 7073 samples in training set
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 4715, number of sources: 1, batch size per gpu: 16, steps: 295
2021-01-16 15:17:17,728 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-01-16 15:17:17,971 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2021-01-16 15:17:17,976 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-01-16 15:17:17,976 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2021-01-16 15:17:18,316 [INFO] iva.detectnet_v2.scripts.train: Found 4715 samples in validation set
2021-01-16 15:17:28.242674: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.243559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
2021-01-16 15:17:28.243622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:17:28.243721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-16 15:17:28.243780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-16 15:17:28.243809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-16 15:17:28.243834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-16 15:17:28.243865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-16 15:17:28.243889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-16 15:17:28.244021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.244898: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.245681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-01-16 15:17:28.716838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-16 15:17:28.716878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-01-16 15:17:28.716896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-01-16 15:17:28.717189: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.718163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.718966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13906 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2021-01-16 15:17:30.424008: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
[[{{node save/RestoreV2}}]]
(1) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_945]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1290, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_945]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'save/RestoreV2':
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main
File "<decorator-gen-2>", line 2, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 147, in run_training_loop
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
self._sess = self._coordinated_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 638, in create_session
self._scaffold.finalize()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
self.build()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
name="restore_shard"))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1300, in restore
names_to_keys = object_graph_key_mapping(save_path)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1618, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main
File "<decorator-gen-2>", line 2, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 147, in run_training_loop
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
self._sess = self._coordinated_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 204, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1306, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found.
(0) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_945]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'save/RestoreV2':
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main
File "<decorator-gen-2>", line 2, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 147, in run_training_loop
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
self._sess = self._coordinated_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 638, in create_session
self._scaffold.finalize()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 229, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 599, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
self.build()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
name="restore_shard"))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()