DetectNet_V2 - ValueError: Cannot find a min overlap threshold for label

Hi I’m trying to train DetectNet_V2 and I’m getting the following error.

Using TensorFlow backend.
2020-08-14 06:05:43.906278: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[[32830,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: dce1b73ec71f

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-08-14 06:05:46.003764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-14 06:05:46.026135: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:05:46.026608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:09:00.0
2020-08-14 06:05:46.026629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 06:05:46.026670: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-14 06:05:46.027783: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-14 06:05:46.028100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-14 06:05:46.029427: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-14 06:05:46.030660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-14 06:05:46.030808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-14 06:05:46.031087: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:05:46.031718: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:05:46.032190: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-14 06:05:46.032225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 06:05:46.647721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-14 06:05:46.647769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-08-14 06:05:46.647777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-08-14 06:05:46.648049: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:05:46.648525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:05:46.648976: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:05:46.649377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9831 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2020-08-14 06:05:46,650 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at detect_net_config.txt.
2020-08-14 06:05:46,652 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from detect_net_config.txt
2020-08-14 06:05:46,813 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 603 samples with a batch size of 24; each epoch will therefore take one extra step.


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 608, 608) 0


conv1 (Conv2D) (None, 64, 304, 304) 9472 input_1[0][0]


bn_conv1 (BatchNormalization) (None, 64, 304, 304) 256 conv1[0][0]


activation_1 (Activation) (None, 64, 304, 304) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 152, 152) 36928 activation_1[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 152, 152) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 152, 152) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 152, 152) 36928 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 152, 152) 4160 activation_1[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 152, 152) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 152, 152) 256 block_1a_conv_shortcut[0][0]


add_1 (Add) (None, 64, 152, 152) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 152, 152) 0 add_1[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 152, 152) 36928 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 152, 152) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 152, 152) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 152, 152) 36928 block_1b_relu_1[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 152, 152) 256 block_1b_conv_2[0][0]


add_2 (Add) (None, 64, 152, 152) 0 block_1b_bn_2[0][0]
block_1a_relu[0][0]


block_1b_relu (Activation) (None, 64, 152, 152) 0 add_2[0][0]


block_1c_conv_1 (Conv2D) (None, 64, 152, 152) 36928 block_1b_relu[0][0]


block_1c_bn_1 (BatchNormalizati (None, 64, 152, 152) 256 block_1c_conv_1[0][0]


block_1c_relu_1 (Activation) (None, 64, 152, 152) 0 block_1c_bn_1[0][0]


block_1c_conv_2 (Conv2D) (None, 64, 152, 152) 36928 block_1c_relu_1[0][0]


block_1c_bn_2 (BatchNormalizati (None, 64, 152, 152) 256 block_1c_conv_2[0][0]


add_3 (Add) (None, 64, 152, 152) 0 block_1c_bn_2[0][0]
block_1b_relu[0][0]


block_1c_relu (Activation) (None, 64, 152, 152) 0 add_3[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 76, 76) 73856 block_1c_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 76, 76) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 76, 76) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 76, 76) 147584 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 76, 76) 8320 block_1c_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 76, 76) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 76, 76) 512 block_2a_conv_shortcut[0][0]


add_4 (Add) (None, 128, 76, 76) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 76, 76) 0 add_4[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 76, 76) 147584 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 76, 76) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 76, 76) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 76, 76) 147584 block_2b_relu_1[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 76, 76) 512 block_2b_conv_2[0][0]


add_5 (Add) (None, 128, 76, 76) 0 block_2b_bn_2[0][0]
block_2a_relu[0][0]


block_2b_relu (Activation) (None, 128, 76, 76) 0 add_5[0][0]


block_2c_conv_1 (Conv2D) (None, 128, 76, 76) 147584 block_2b_relu[0][0]


block_2c_bn_1 (BatchNormalizati (None, 128, 76, 76) 512 block_2c_conv_1[0][0]


block_2c_relu_1 (Activation) (None, 128, 76, 76) 0 block_2c_bn_1[0][0]


block_2c_conv_2 (Conv2D) (None, 128, 76, 76) 147584 block_2c_relu_1[0][0]


block_2c_bn_2 (BatchNormalizati (None, 128, 76, 76) 512 block_2c_conv_2[0][0]


add_6 (Add) (None, 128, 76, 76) 0 block_2c_bn_2[0][0]
block_2b_relu[0][0]


block_2c_relu (Activation) (None, 128, 76, 76) 0 add_6[0][0]


block_2d_conv_1 (Conv2D) (None, 128, 76, 76) 147584 block_2c_relu[0][0]


block_2d_bn_1 (BatchNormalizati (None, 128, 76, 76) 512 block_2d_conv_1[0][0]


block_2d_relu_1 (Activation) (None, 128, 76, 76) 0 block_2d_bn_1[0][0]


block_2d_conv_2 (Conv2D) (None, 128, 76, 76) 147584 block_2d_relu_1[0][0]


block_2d_bn_2 (BatchNormalizati (None, 128, 76, 76) 512 block_2d_conv_2[0][0]


add_7 (Add) (None, 128, 76, 76) 0 block_2d_bn_2[0][0]
block_2c_relu[0][0]


block_2d_relu (Activation) (None, 128, 76, 76) 0 add_7[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 38, 38) 295168 block_2d_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 38, 38) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 38, 38) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 38, 38) 590080 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 38, 38) 33024 block_2d_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 38, 38) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 38, 38) 1024 block_3a_conv_shortcut[0][0]


add_8 (Add) (None, 256, 38, 38) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 38, 38) 0 add_8[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 38, 38) 590080 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 38, 38) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 38, 38) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 38, 38) 590080 block_3b_relu_1[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 38, 38) 1024 block_3b_conv_2[0][0]


add_9 (Add) (None, 256, 38, 38) 0 block_3b_bn_2[0][0]
block_3a_relu[0][0]


block_3b_relu (Activation) (None, 256, 38, 38) 0 add_9[0][0]


block_3c_conv_1 (Conv2D) (None, 256, 38, 38) 590080 block_3b_relu[0][0]


block_3c_bn_1 (BatchNormalizati (None, 256, 38, 38) 1024 block_3c_conv_1[0][0]


block_3c_relu_1 (Activation) (None, 256, 38, 38) 0 block_3c_bn_1[0][0]


block_3c_conv_2 (Conv2D) (None, 256, 38, 38) 590080 block_3c_relu_1[0][0]


block_3c_bn_2 (BatchNormalizati (None, 256, 38, 38) 1024 block_3c_conv_2[0][0]


add_10 (Add) (None, 256, 38, 38) 0 block_3c_bn_2[0][0]
block_3b_relu[0][0]


block_3c_relu (Activation) (None, 256, 38, 38) 0 add_10[0][0]


block_3d_conv_1 (Conv2D) (None, 256, 38, 38) 590080 block_3c_relu[0][0]


block_3d_bn_1 (BatchNormalizati (None, 256, 38, 38) 1024 block_3d_conv_1[0][0]


block_3d_relu_1 (Activation) (None, 256, 38, 38) 0 block_3d_bn_1[0][0]


block_3d_conv_2 (Conv2D) (None, 256, 38, 38) 590080 block_3d_relu_1[0][0]


block_3d_bn_2 (BatchNormalizati (None, 256, 38, 38) 1024 block_3d_conv_2[0][0]


add_11 (Add) (None, 256, 38, 38) 0 block_3d_bn_2[0][0]
block_3c_relu[0][0]


block_3d_relu (Activation) (None, 256, 38, 38) 0 add_11[0][0]


block_3e_conv_1 (Conv2D) (None, 256, 38, 38) 590080 block_3d_relu[0][0]


block_3e_bn_1 (BatchNormalizati (None, 256, 38, 38) 1024 block_3e_conv_1[0][0]


block_3e_relu_1 (Activation) (None, 256, 38, 38) 0 block_3e_bn_1[0][0]


block_3e_conv_2 (Conv2D) (None, 256, 38, 38) 590080 block_3e_relu_1[0][0]


block_3e_bn_2 (BatchNormalizati (None, 256, 38, 38) 1024 block_3e_conv_2[0][0]


add_12 (Add) (None, 256, 38, 38) 0 block_3e_bn_2[0][0]
block_3d_relu[0][0]


block_3e_relu (Activation) (None, 256, 38, 38) 0 add_12[0][0]


block_3f_conv_1 (Conv2D) (None, 256, 38, 38) 590080 block_3e_relu[0][0]


block_3f_bn_1 (BatchNormalizati (None, 256, 38, 38) 1024 block_3f_conv_1[0][0]


block_3f_relu_1 (Activation) (None, 256, 38, 38) 0 block_3f_bn_1[0][0]


block_3f_conv_2 (Conv2D) (None, 256, 38, 38) 590080 block_3f_relu_1[0][0]


block_3f_bn_2 (BatchNormalizati (None, 256, 38, 38) 1024 block_3f_conv_2[0][0]


add_13 (Add) (None, 256, 38, 38) 0 block_3f_bn_2[0][0]
block_3e_relu[0][0]


block_3f_relu (Activation) (None, 256, 38, 38) 0 add_13[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 38, 38) 1180160 block_3f_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 38, 38) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 38, 38) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 38, 38) 2359808 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 38, 38) 131584 block_3f_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 38, 38) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 38, 38) 2048 block_4a_conv_shortcut[0][0]


add_14 (Add) (None, 512, 38, 38) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]


block_4a_relu (Activation) (None, 512, 38, 38) 0 add_14[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 38, 38) 2359808 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 38, 38) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 38, 38) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 38, 38) 2359808 block_4b_relu_1[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 38, 38) 2048 block_4b_conv_2[0][0]


add_15 (Add) (None, 512, 38, 38) 0 block_4b_bn_2[0][0]
block_4a_relu[0][0]


block_4b_relu (Activation) (None, 512, 38, 38) 0 add_15[0][0]


block_4c_conv_1 (Conv2D) (None, 512, 38, 38) 2359808 block_4b_relu[0][0]


block_4c_bn_1 (BatchNormalizati (None, 512, 38, 38) 2048 block_4c_conv_1[0][0]


block_4c_relu_1 (Activation) (None, 512, 38, 38) 0 block_4c_bn_1[0][0]


block_4c_conv_2 (Conv2D) (None, 512, 38, 38) 2359808 block_4c_relu_1[0][0]


block_4c_bn_2 (BatchNormalizati (None, 512, 38, 38) 2048 block_4c_conv_2[0][0]


add_16 (Add) (None, 512, 38, 38) 0 block_4c_bn_2[0][0]
block_4b_relu[0][0]


block_4c_relu (Activation) (None, 512, 38, 38) 0 add_16[0][0]


output_bbox (Conv2D) (None, 12, 38, 38) 6156 block_4c_relu[0][0]


output_cov (Conv2D) (None, 3, 38, 38) 1539 block_4c_relu[0][0]

Total params: 21,322,319
Trainable params: 21,295,695
Non-trainable params: 26,624


2020-08-14 06:06:08,625 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-08-14 06:06:08,625 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-08-14 06:06:08,625 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-08-14 06:06:08,625 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2020-08-14 06:06:08,625 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 603, number of sources: 1, batch size per gpu: 24, steps: 26
2020-08-14 06:06:08,706 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-08-14 06:06:08.729691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:06:08.730180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:09:00.0
2020-08-14 06:06:08.730200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 06:06:08.730227: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-14 06:06:08.730245: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-14 06:06:08.730260: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-14 06:06:08.730273: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-14 06:06:08.730287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-14 06:06:08.730299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-14 06:06:08.730367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:06:08.730798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:06:08.731168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-14 06:06:08,893 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-08-14 06:06:08,897 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-08-14 06:06:08,897 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-08-14 06:06:09,256 [INFO] iva.detectnet_v2.scripts.train: Found 603 samples in training set
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 603, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/evaluation/evaluation_config.py”, line 49, in build_evaluation_config
ValueError: Cannot find a min overlap threshold for with_mask
root@dce1b73ec71f:/workspace/tlt-experiments# tlt-train detectnet_v2 -e detect_net_config.txt -r output_dir/ -k tlt_encode --gpus 1 > trainout.txt
Using TensorFlow backend.
2020-08-14 06:07:18.374098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[[32898,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: dce1b73ec71f

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-08-14 06:07:20.432041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-14 06:07:20.455369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:20.456084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:09:00.0
2020-08-14 06:07:20.456113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 06:07:20.456164: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-14 06:07:20.457476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-14 06:07:20.457724: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-14 06:07:20.459111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-14 06:07:20.459944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-14 06:07:20.459988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-14 06:07:20.460095: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:20.460581: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:20.460994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-14 06:07:20.461020: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 06:07:21.073918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-14 06:07:21.073966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-08-14 06:07:21.073974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-08-14 06:07:21.074230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:21.074695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:21.075136: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:21.075533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9831 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2020-08-14 06:07:21,076 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at detect_net_config.txt.
2020-08-14 06:07:21,077 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from detect_net_config.txt
2020-08-14 06:07:21,240 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 603 samples with a batch size of 24; each epoch will therefore take one extra step.
2020-08-14 06:07:43,190 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-08-14 06:07:43,190 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-08-14 06:07:43,190 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-08-14 06:07:43,190 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2020-08-14 06:07:43,190 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 603, number of sources: 1, batch size per gpu: 24, steps: 26
2020-08-14 06:07:43,267 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-08-14 06:07:43.290287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:43.290827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:09:00.0
2020-08-14 06:07:43.290848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 06:07:43.290872: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-14 06:07:43.290889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-14 06:07:43.290903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-14 06:07:43.290916: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-14 06:07:43.290928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-14 06:07:43.290940: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-14 06:07:43.291006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:43.291437: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 06:07:43.291819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-14 06:07:43,446 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-08-14 06:07:43,450 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-08-14 06:07:43,451 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-08-14 06:07:43,805 [INFO] iva.detectnet_v2.scripts.train: Found 603 samples in training set
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 603, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/evaluation/evaluation_config.py”, line 49, in build_evaluation_config
ValueError: Cannot find a min overlap threshold for with_mask

I am attaching the config file below.

detect_net_config.txt (4.3 KB)

You are missing evaluation_config in the spec.
Please refer to specs inside the docker.