Errors in Training, 0 or Nan mAP, Low Loss, Tutorial Config

Sneaky_Turtle · January 16, 2021, 3:08pm

I am working on Amazon AWS in an EC2 instance. My dataset is Caltech-Birds-201, padded to 512x512. My loss rates are beyond a thousandth of a percentile, my mAP starts as Nan and eventually becomes 0, and I have output like this:

Validation cost: -0.000010
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
bird                                   0

Median Inference Time: 0.014849
2021-01-16 13:01:12,535 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 71/80: loss: 0.00001 Time taken: 0:05:19.982588 ETA: 0:47:59.843290
2021-01-16 13:01:24,731 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 4.704

79/80th Epoch:

2021-01-16 13:44:47,171 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 79/80: loss: 0.00001 Time taken: 0:04:11.581900 ETA: 0:04:11.581900

KITTI Config:

kitti_config {
root_directory_path: “/data”
image_dir_name: “images”
label_dir_name: “labels”
image_extension: “.jpeg”
partition_mode: “random”
num_partitions: 2
val_split: 20
num_shards: 10
}
image_directory_path: “/data/images”

Example KITTI label:

bird 0.0 0 0.0 60.0 27.0 385.0 331.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Spec File:

model_config {
arch: “resnet”
pretrained_model_file: “/data/tlt_resnet50_detectnetv2_v1/resnet50.hdf5”
freeze_blocks: 0
freeze_blocks: 1
all_projections: True
num_layers: 18
use_pooling: False
use_batch_norm: True
dropout_rate: 0.0
training_precision: {
backend_floatx: FLOAT32
}
objective_set: {
cov {}
bbox {
scale: 35.0
offset: 0.5
}
}
}

bbox_rasterizer_config {
target_class_config {
key: “bird”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.67
}

postprocessing_config {
target_class_config {
key: “bird”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
}

cost_function_config {
target_classes {
name: “bird”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: True
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}

training_config {
batch_size_per_gpu: 16
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
enabled: False
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
}

augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
color_shift_stddev: 0.0
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
}
}

evaluation_config {
average_precision_mode: INTEGRATE
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “bird”
value: 0.7
}

evaluation_box_config {
key: “bird”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}

dataset_config {
data_sources: {
tfrecords_path: “/data/tfrecords/*”
image_directory_path: “/data/images/”
}
image_extension: “jpg”
target_class_mapping {
key: “bird”
value: “bird”
}
validation_fold: 0
}

I am thinking it has to do with my specfile. I am rather new to using the TLT and trying to learn using my own single class dataset. Can someone have a look and advise me as to the best course of action to solve this?

Morganh · January 16, 2021, 3:12pm

Please modify above to
output_image_width: 512
output_image_height: 512

Reference: https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/text/supported_model_architectures.html#detectnet-v2

The tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

Sneaky_Turtle · January 16, 2021, 3:19pm

@Morganh I followed your instruction. I assume this is because of a checkpoint I forgot to delete, but have a look:

root@eda82919eac9:/data# tlt-train detectnet_v2 -e ./train.txt -r ./trained -k KEY
Using TensorFlow backend.
2021-01-16 15:16:05.507174: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
--------------------------------------------------------------------------
[[15188,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: eda82919eac9

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2021-01-16 15:16:08.114791: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-01-16 15:16:08.138756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.139642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
2021-01-16 15:16:08.139681: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:16:08.139761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-16 15:16:08.140974: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-16 15:16:08.141370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-16 15:16:08.143093: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-16 15:16:08.144344: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-16 15:16:08.144428: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-16 15:16:08.144561: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.145483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.146308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-01-16 15:16:08.146355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:16:08.901755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-16 15:16:08.901806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-01-16 15:16:08.901822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-01-16 15:16:08.902105: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.903116: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.904017: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:16:08.904840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13906 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2021-01-16 15:16:08,905 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at ./train.txt.
2021-01-16 15:16:08,907 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from ./train.txt
2021-01-16 15:16:09,585 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 7073 samples with a batch size of 16; each epoch will therefore take one extra step.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 512, 512)  0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 256, 256) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 256, 256) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 256, 256) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 128, 128) 4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 128, 128) 256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (None, 64, 128, 128) 0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 128, 128) 36928       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 128, 128) 256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_relu_2 (Activation)    (None, 64, 128, 128) 0           block_1a_bn_2[0][0]              
__________________________________________________________________________________________________
block_1a_conv_3 (Conv2D)        (None, 256, 128, 128 16640       block_1a_relu_2[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 256, 128, 128 16640       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_3 (BatchNormalizati (None, 256, 128, 128 1024        block_1a_conv_3[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 256, 128, 128 1024        block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 256, 128, 128 0           block_1a_bn_3[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1a_relu (Activation)      (None, 256, 128, 128 0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 128, 128) 16448       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 128, 128) 256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
block_1b_relu_1 (Activation)    (None, 64, 128, 128) 0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 128, 128) 36928       block_1b_relu_1[0][0]            
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 128, 128) 256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
block_1b_relu_2 (Activation)    (None, 64, 128, 128) 0           block_1b_bn_2[0][0]              
__________________________________________________________________________________________________
block_1b_conv_3 (Conv2D)        (None, 256, 128, 128 16640       block_1b_relu_2[0][0]            
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (None, 256, 128, 128 65792       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_3 (BatchNormalizati (None, 256, 128, 128 1024        block_1b_conv_3[0][0]            
__________________________________________________________________________________________________
block_1b_bn_shortcut (BatchNorm (None, 256, 128, 128 1024        block_1b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (None, 256, 128, 128 0           block_1b_bn_3[0][0]              
                                                                 block_1b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1b_relu (Activation)      (None, 256, 128, 128 0           add_2[0][0]                      
__________________________________________________________________________________________________
block_1c_conv_1 (Conv2D)        (None, 64, 128, 128) 16448       block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_1c_bn_1 (BatchNormalizati (None, 64, 128, 128) 256         block_1c_conv_1[0][0]            
__________________________________________________________________________________________________
block_1c_relu_1 (Activation)    (None, 64, 128, 128) 0           block_1c_bn_1[0][0]              
__________________________________________________________________________________________________
block_1c_conv_2 (Conv2D)        (None, 64, 128, 128) 36928       block_1c_relu_1[0][0]            
__________________________________________________________________________________________________
block_1c_bn_2 (BatchNormalizati (None, 64, 128, 128) 256         block_1c_conv_2[0][0]            
__________________________________________________________________________________________________
block_1c_relu_2 (Activation)    (None, 64, 128, 128) 0           block_1c_bn_2[0][0]              
__________________________________________________________________________________________________
block_1c_conv_3 (Conv2D)        (None, 256, 128, 128 16640       block_1c_relu_2[0][0]            
__________________________________________________________________________________________________
block_1c_conv_shortcut (Conv2D) (None, 256, 128, 128 65792       block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_1c_bn_3 (BatchNormalizati (None, 256, 128, 128 1024        block_1c_conv_3[0][0]            
__________________________________________________________________________________________________
block_1c_bn_shortcut (BatchNorm (None, 256, 128, 128 1024        block_1c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 256, 128, 128 0           block_1c_bn_3[0][0]              
                                                                 block_1c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1c_relu (Activation)      (None, 256, 128, 128 0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 64, 64)  32896       block_1c_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 64, 64)  512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (None, 128, 64, 64)  0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 64, 64)  147584      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 64, 64)  512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_relu_2 (Activation)    (None, 128, 64, 64)  0           block_2a_bn_2[0][0]              
__________________________________________________________________________________________________
block_2a_conv_3 (Conv2D)        (None, 512, 64, 64)  66048       block_2a_relu_2[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 512, 64, 64)  131584      block_1c_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_3 (BatchNormalizati (None, 512, 64, 64)  2048        block_2a_conv_3[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 512, 64, 64)  2048        block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (None, 512, 64, 64)  0           block_2a_bn_3[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2a_relu (Activation)      (None, 512, 64, 64)  0           add_4[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 64, 64)  65664       block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 64, 64)  512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
block_2b_relu_1 (Activation)    (None, 128, 64, 64)  0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 64, 64)  147584      block_2b_relu_1[0][0]            
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 64, 64)  512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
block_2b_relu_2 (Activation)    (None, 128, 64, 64)  0           block_2b_bn_2[0][0]              
__________________________________________________________________________________________________
block_2b_conv_3 (Conv2D)        (None, 512, 64, 64)  66048       block_2b_relu_2[0][0]            
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (None, 512, 64, 64)  262656      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_3 (BatchNormalizati (None, 512, 64, 64)  2048        block_2b_conv_3[0][0]            
__________________________________________________________________________________________________
block_2b_bn_shortcut (BatchNorm (None, 512, 64, 64)  2048        block_2b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 512, 64, 64)  0           block_2b_bn_3[0][0]              
                                                                 block_2b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2b_relu (Activation)      (None, 512, 64, 64)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_2c_conv_1 (Conv2D)        (None, 128, 64, 64)  65664       block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_2c_bn_1 (BatchNormalizati (None, 128, 64, 64)  512         block_2c_conv_1[0][0]            
__________________________________________________________________________________________________
block_2c_relu_1 (Activation)    (None, 128, 64, 64)  0           block_2c_bn_1[0][0]              
__________________________________________________________________________________________________
block_2c_conv_2 (Conv2D)        (None, 128, 64, 64)  147584      block_2c_relu_1[0][0]            
__________________________________________________________________________________________________
block_2c_bn_2 (BatchNormalizati (None, 128, 64, 64)  512         block_2c_conv_2[0][0]            
__________________________________________________________________________________________________
block_2c_relu_2 (Activation)    (None, 128, 64, 64)  0           block_2c_bn_2[0][0]              
__________________________________________________________________________________________________
block_2c_conv_3 (Conv2D)        (None, 512, 64, 64)  66048       block_2c_relu_2[0][0]            
__________________________________________________________________________________________________
block_2c_conv_shortcut (Conv2D) (None, 512, 64, 64)  262656      block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_2c_bn_3 (BatchNormalizati (None, 512, 64, 64)  2048        block_2c_conv_3[0][0]            
__________________________________________________________________________________________________
block_2c_bn_shortcut (BatchNorm (None, 512, 64, 64)  2048        block_2c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_6 (Add)                     (None, 512, 64, 64)  0           block_2c_bn_3[0][0]              
                                                                 block_2c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2c_relu (Activation)      (None, 512, 64, 64)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_2d_conv_1 (Conv2D)        (None, 128, 64, 64)  65664       block_2c_relu[0][0]              
__________________________________________________________________________________________________
block_2d_bn_1 (BatchNormalizati (None, 128, 64, 64)  512         block_2d_conv_1[0][0]            
__________________________________________________________________________________________________
block_2d_relu_1 (Activation)    (None, 128, 64, 64)  0           block_2d_bn_1[0][0]              
__________________________________________________________________________________________________
block_2d_conv_2 (Conv2D)        (None, 128, 64, 64)  147584      block_2d_relu_1[0][0]            
__________________________________________________________________________________________________
block_2d_bn_2 (BatchNormalizati (None, 128, 64, 64)  512         block_2d_conv_2[0][0]            
__________________________________________________________________________________________________
block_2d_relu_2 (Activation)    (None, 128, 64, 64)  0           block_2d_bn_2[0][0]              
__________________________________________________________________________________________________
block_2d_conv_3 (Conv2D)        (None, 512, 64, 64)  66048       block_2d_relu_2[0][0]            
__________________________________________________________________________________________________
block_2d_conv_shortcut (Conv2D) (None, 512, 64, 64)  262656      block_2c_relu[0][0]              
__________________________________________________________________________________________________
block_2d_bn_3 (BatchNormalizati (None, 512, 64, 64)  2048        block_2d_conv_3[0][0]            
__________________________________________________________________________________________________
block_2d_bn_shortcut (BatchNorm (None, 512, 64, 64)  2048        block_2d_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 64, 64)  0           block_2d_bn_3[0][0]              
                                                                 block_2d_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2d_relu (Activation)      (None, 512, 64, 64)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 32, 32)  131328      block_2d_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 32, 32)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (None, 256, 32, 32)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 32, 32)  590080      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 32, 32)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_relu_2 (Activation)    (None, 256, 32, 32)  0           block_3a_bn_2[0][0]              
__________________________________________________________________________________________________
block_3a_conv_3 (Conv2D)        (None, 1024, 32, 32) 263168      block_3a_relu_2[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 1024, 32, 32) 525312      block_2d_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096        block_3a_conv_3[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_8 (Add)                     (None, 1024, 32, 32) 0           block_3a_bn_3[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (None, 1024, 32, 32) 0           add_8[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 32, 32)  262400      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 32, 32)  1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (None, 256, 32, 32)  0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 32, 32)  590080      block_3b_relu_1[0][0]            
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 32, 32)  1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
block_3b_relu_2 (Activation)    (None, 256, 32, 32)  0           block_3b_bn_2[0][0]              
__________________________________________________________________________________________________
block_3b_conv_3 (Conv2D)        (None, 1024, 32, 32) 263168      block_3b_relu_2[0][0]            
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600     block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096        block_3b_conv_3[0][0]            
__________________________________________________________________________________________________
block_3b_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096        block_3b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_9 (Add)                     (None, 1024, 32, 32) 0           block_3b_bn_3[0][0]              
                                                                 block_3b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3b_relu (Activation)      (None, 1024, 32, 32) 0           add_9[0][0]                      
__________________________________________________________________________________________________
block_3c_conv_1 (Conv2D)        (None, 256, 32, 32)  262400      block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_3c_bn_1 (BatchNormalizati (None, 256, 32, 32)  1024        block_3c_conv_1[0][0]            
__________________________________________________________________________________________________
block_3c_relu_1 (Activation)    (None, 256, 32, 32)  0           block_3c_bn_1[0][0]              
__________________________________________________________________________________________________
block_3c_conv_2 (Conv2D)        (None, 256, 32, 32)  590080      block_3c_relu_1[0][0]            
__________________________________________________________________________________________________
block_3c_bn_2 (BatchNormalizati (None, 256, 32, 32)  1024        block_3c_conv_2[0][0]            
__________________________________________________________________________________________________
block_3c_relu_2 (Activation)    (None, 256, 32, 32)  0           block_3c_bn_2[0][0]              
__________________________________________________________________________________________________
block_3c_conv_3 (Conv2D)        (None, 1024, 32, 32) 263168      block_3c_relu_2[0][0]            
__________________________________________________________________________________________________
block_3c_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600     block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_3c_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096        block_3c_conv_3[0][0]            
__________________________________________________________________________________________________
block_3c_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096        block_3c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_10 (Add)                    (None, 1024, 32, 32) 0           block_3c_bn_3[0][0]              
                                                                 block_3c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3c_relu (Activation)      (None, 1024, 32, 32) 0           add_10[0][0]                     
__________________________________________________________________________________________________
block_3d_conv_1 (Conv2D)        (None, 256, 32, 32)  262400      block_3c_relu[0][0]              
__________________________________________________________________________________________________
block_3d_bn_1 (BatchNormalizati (None, 256, 32, 32)  1024        block_3d_conv_1[0][0]            
__________________________________________________________________________________________________
block_3d_relu_1 (Activation)    (None, 256, 32, 32)  0           block_3d_bn_1[0][0]              
__________________________________________________________________________________________________
block_3d_conv_2 (Conv2D)        (None, 256, 32, 32)  590080      block_3d_relu_1[0][0]            
__________________________________________________________________________________________________
block_3d_bn_2 (BatchNormalizati (None, 256, 32, 32)  1024        block_3d_conv_2[0][0]            
__________________________________________________________________________________________________
block_3d_relu_2 (Activation)    (None, 256, 32, 32)  0           block_3d_bn_2[0][0]              
__________________________________________________________________________________________________
block_3d_conv_3 (Conv2D)        (None, 1024, 32, 32) 263168      block_3d_relu_2[0][0]            
__________________________________________________________________________________________________
block_3d_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600     block_3c_relu[0][0]              
__________________________________________________________________________________________________
block_3d_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096        block_3d_conv_3[0][0]            
__________________________________________________________________________________________________
block_3d_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096        block_3d_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_11 (Add)                    (None, 1024, 32, 32) 0           block_3d_bn_3[0][0]              
                                                                 block_3d_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3d_relu (Activation)      (None, 1024, 32, 32) 0           add_11[0][0]                     
__________________________________________________________________________________________________
block_3e_conv_1 (Conv2D)        (None, 256, 32, 32)  262400      block_3d_relu[0][0]              
__________________________________________________________________________________________________
block_3e_bn_1 (BatchNormalizati (None, 256, 32, 32)  1024        block_3e_conv_1[0][0]            
__________________________________________________________________________________________________
block_3e_relu_1 (Activation)    (None, 256, 32, 32)  0           block_3e_bn_1[0][0]              
__________________________________________________________________________________________________
block_3e_conv_2 (Conv2D)        (None, 256, 32, 32)  590080      block_3e_relu_1[0][0]            
__________________________________________________________________________________________________
block_3e_bn_2 (BatchNormalizati (None, 256, 32, 32)  1024        block_3e_conv_2[0][0]            
__________________________________________________________________________________________________
block_3e_relu_2 (Activation)    (None, 256, 32, 32)  0           block_3e_bn_2[0][0]              
__________________________________________________________________________________________________
block_3e_conv_3 (Conv2D)        (None, 1024, 32, 32) 263168      block_3e_relu_2[0][0]            
__________________________________________________________________________________________________
block_3e_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600     block_3d_relu[0][0]              
__________________________________________________________________________________________________
block_3e_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096        block_3e_conv_3[0][0]            
__________________________________________________________________________________________________
block_3e_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096        block_3e_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_12 (Add)                    (None, 1024, 32, 32) 0           block_3e_bn_3[0][0]              
                                                                 block_3e_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3e_relu (Activation)      (None, 1024, 32, 32) 0           add_12[0][0]                     
__________________________________________________________________________________________________
block_3f_conv_1 (Conv2D)        (None, 256, 32, 32)  262400      block_3e_relu[0][0]              
__________________________________________________________________________________________________
block_3f_bn_1 (BatchNormalizati (None, 256, 32, 32)  1024        block_3f_conv_1[0][0]            
__________________________________________________________________________________________________
block_3f_relu_1 (Activation)    (None, 256, 32, 32)  0           block_3f_bn_1[0][0]              
__________________________________________________________________________________________________
block_3f_conv_2 (Conv2D)        (None, 256, 32, 32)  590080      block_3f_relu_1[0][0]            
__________________________________________________________________________________________________
block_3f_bn_2 (BatchNormalizati (None, 256, 32, 32)  1024        block_3f_conv_2[0][0]            
__________________________________________________________________________________________________
block_3f_relu_2 (Activation)    (None, 256, 32, 32)  0           block_3f_bn_2[0][0]              
__________________________________________________________________________________________________
block_3f_conv_3 (Conv2D)        (None, 1024, 32, 32) 263168      block_3f_relu_2[0][0]            
__________________________________________________________________________________________________
block_3f_conv_shortcut (Conv2D) (None, 1024, 32, 32) 1049600     block_3e_relu[0][0]              
__________________________________________________________________________________________________
block_3f_bn_3 (BatchNormalizati (None, 1024, 32, 32) 4096        block_3f_conv_3[0][0]            
__________________________________________________________________________________________________
block_3f_bn_shortcut (BatchNorm (None, 1024, 32, 32) 4096        block_3f_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_13 (Add)                    (None, 1024, 32, 32) 0           block_3f_bn_3[0][0]              
                                                                 block_3f_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3f_relu (Activation)      (None, 1024, 32, 32) 0           add_13[0][0]                     
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 32, 32)  524800      block_3f_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 32, 32)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (None, 512, 32, 32)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 32, 32)  2359808     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 32, 32)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_relu_2 (Activation)    (None, 512, 32, 32)  0           block_4a_bn_2[0][0]              
__________________________________________________________________________________________________
block_4a_conv_3 (Conv2D)        (None, 2048, 32, 32) 1050624     block_4a_relu_2[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 2048, 32, 32) 2099200     block_3f_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_3 (BatchNormalizati (None, 2048, 32, 32) 8192        block_4a_conv_3[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 2048, 32, 32) 8192        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_14 (Add)                    (None, 2048, 32, 32) 0           block_4a_bn_3[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (None, 2048, 32, 32) 0           add_14[0][0]                     
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 32, 32)  1049088     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 32, 32)  2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (None, 512, 32, 32)  0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 32, 32)  2359808     block_4b_relu_1[0][0]            
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 32, 32)  2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
block_4b_relu_2 (Activation)    (None, 512, 32, 32)  0           block_4b_bn_2[0][0]              
__________________________________________________________________________________________________
block_4b_conv_3 (Conv2D)        (None, 2048, 32, 32) 1050624     block_4b_relu_2[0][0]            
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (None, 2048, 32, 32) 4196352     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_3 (BatchNormalizati (None, 2048, 32, 32) 8192        block_4b_conv_3[0][0]            
__________________________________________________________________________________________________
block_4b_bn_shortcut (BatchNorm (None, 2048, 32, 32) 8192        block_4b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_15 (Add)                    (None, 2048, 32, 32) 0           block_4b_bn_3[0][0]              
                                                                 block_4b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4b_relu (Activation)      (None, 2048, 32, 32) 0           add_15[0][0]                     
__________________________________________________________________________________________________
block_4c_conv_1 (Conv2D)        (None, 512, 32, 32)  1049088     block_4b_relu[0][0]              
__________________________________________________________________________________________________
block_4c_bn_1 (BatchNormalizati (None, 512, 32, 32)  2048        block_4c_conv_1[0][0]            
__________________________________________________________________________________________________
block_4c_relu_1 (Activation)    (None, 512, 32, 32)  0           block_4c_bn_1[0][0]              
__________________________________________________________________________________________________
block_4c_conv_2 (Conv2D)        (None, 512, 32, 32)  2359808     block_4c_relu_1[0][0]            
__________________________________________________________________________________________________
block_4c_bn_2 (BatchNormalizati (None, 512, 32, 32)  2048        block_4c_conv_2[0][0]            
__________________________________________________________________________________________________
block_4c_relu_2 (Activation)    (None, 512, 32, 32)  0           block_4c_bn_2[0][0]              
__________________________________________________________________________________________________
block_4c_conv_3 (Conv2D)        (None, 2048, 32, 32) 1050624     block_4c_relu_2[0][0]            
__________________________________________________________________________________________________
block_4c_conv_shortcut (Conv2D) (None, 2048, 32, 32) 4196352     block_4b_relu[0][0]              
__________________________________________________________________________________________________
block_4c_bn_3 (BatchNormalizati (None, 2048, 32, 32) 8192        block_4c_conv_3[0][0]            
__________________________________________________________________________________________________
block_4c_bn_shortcut (BatchNorm (None, 2048, 32, 32) 8192        block_4c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_16 (Add)                    (None, 2048, 32, 32) 0           block_4c_bn_3[0][0]              
                                                                 block_4c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4c_relu (Activation)      (None, 2048, 32, 32) 0           add_16[0][0]                     
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 4, 32, 32)    8196        block_4c_relu[0][0]              
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 1, 32, 32)    2049        block_4c_relu[0][0]              
==================================================================================================
Total params: 38,203,269
Trainable params: 37,772,165
Non-trainable params: 431,104
__________________________________________________________________________________________________
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2021-01-16 15:17:11,713 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 7073, number of sources: 1, batch size per gpu: 16, steps: 443
2021-01-16 15:17:11,811 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-01-16 15:17:11.842567: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:11.843447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
2021-01-16 15:17:11.843493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:17:11.843554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-16 15:17:11.843600: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-16 15:17:11.843634: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-16 15:17:11.843660: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-16 15:17:11.843699: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-16 15:17:11.843729: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-16 15:17:11.843857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:11.844749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:11.845525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-01-16 15:17:12,067 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2021-01-16 15:17:12,073 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-01-16 15:17:12,073 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2021-01-16 15:17:12,573 [INFO] iva.detectnet_v2.scripts.train: Found 7073 samples in training set
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2021-01-16 15:17:17,696 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 4715, number of sources: 1, batch size per gpu: 16, steps: 295
2021-01-16 15:17:17,728 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-01-16 15:17:17,971 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2021-01-16 15:17:17,976 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-01-16 15:17:17,976 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2021-01-16 15:17:18,316 [INFO] iva.detectnet_v2.scripts.train: Found 4715 samples in validation set
2021-01-16 15:17:28.242674: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.243559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
2021-01-16 15:17:28.243622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-16 15:17:28.243721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-16 15:17:28.243780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-16 15:17:28.243809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-16 15:17:28.243834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-16 15:17:28.243865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-16 15:17:28.243889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-16 15:17:28.244021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.244898: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.245681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-01-16 15:17:28.716838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-16 15:17:28.716878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-01-16 15:17:28.716896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-01-16 15:17:28.717189: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.718163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-16 15:17:28.718966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13906 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2021-01-16 15:17:30.424008: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
	 [[{{node save/RestoreV2}}]]
  (1) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
	 [[{{node save/RestoreV2}}]]
	 [[save/RestoreV2/_945]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1290, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[save/RestoreV2/_945]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main
  File "<decorator-gen-2>", line 2, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 147, in run_training_loop
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 638, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 229, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 599, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1300, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1618, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main
  File "<decorator-gen-2>", line 2, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 147, in run_training_loop
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 204, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1306, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Not found: Key block_1a_bn_3/beta/Adam not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[save/RestoreV2/_945]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 55, in main
  File "<decorator-gen-2>", line 2, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 773, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 691, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 624, in train_gridbox
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 147, in run_training_loop
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py", line 143, in get_singular_monitored_session
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1104, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 727, in __init__
    self._sess = self._coordinated_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 638, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 229, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 599, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Morganh · January 16, 2021, 3:23pm

For your latest error, please remove previous result folder or create a new result folder name in the command line.

Sneaky_Turtle · January 16, 2021, 3:26pm

@Morganh

/task_progress_monitor_hook.pyc: Epoch 0/80: loss: 0.06440 Time taken: 0:00:00 ETA: 0:00:00
2021-01-16 15:23:29,731 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.173
2021-01-16 15:24:11,104 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 7.271
2021-01-16 15:24:49,883 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 10.315
2021-01-16 15:25:24,724 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 11.481
2021-01-16 15:26:00,042 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 11.326

Last error was due to the checkpoint.

Now the loss is still kind of low.

Morganh · January 16, 2021, 3:29pm

The loss is expected to be lower and lower. You can wait for the evaluation result every 10 epochs.

Sneaky_Turtle · January 16, 2021, 3:33pm

@Morganh

New Error:

Stats:
Limit: 120324096
InUse: 110412800
MaxInUse: 110412800
NumAllocs: 225
MaxAllocSize: 18560768

2021-01-16 15:32:12.257267: W tensorflow/core/common_runtime/bfc_allocator.cc:424] xxxxxxxxxx********************
2021-01-16 15:32:12.257298: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at assign_op.h:117 : Resource exhausted: OOM when allocating tensor with shape[256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2021-01-16 15:32:12.258278: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 9.45M (9911296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-01-16 15:32:12.259194: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 9.45M (9911296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Does batch size need to be reduced?

Morganh · January 16, 2021, 3:35pm

Yes, you can try lower bs in case of OOM.

Sneaky_Turtle · January 16, 2021, 3:36pm

Reduced batch size to 1:

tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

Morganh · January 16, 2021, 3:43pm

For AWS case, reference: TLT Detectnet TrafficCamNet training not working - #9 by Morganh

Sneaky_Turtle · January 16, 2021, 3:52pm

@Morganh

I think the issue was that I had exited out of training with ctrl+z and the memory was still full.

Currently training has begun after out of docker and restarting.

Loss looks good for first epoch. I found a spot I hadn’t changed “car” to “bird” in configuration of the spec.

/task_progress_monitor_hook.pyc: Epoch 0/80: loss: 9.40944 Time taken: 0:00:00 ETA: 0:00:00
2021-01-16 15:46:43,815 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.190

@Morganh

Thanks for responding and helping. 10/10 on the support from you.

Sneaky_Turtle · January 16, 2021, 6:03pm

Precision is at 0%.

Validation cost: 0.002756
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
bird                                   0

Morganh · January 17, 2021, 2:40am

Can you attach your latest training spec?
I find that you were training a resnet18 network. But you were using a resnet50 pretrained model.

Please attach your full training log as well.

Sneaky_Turtle · January 17, 2021, 2:43am

model_config {
arch: “resnet”
pretrained_model_file: “/data/tlt_resnet50_detectnetv2_v1/resnet50.hdf5”
freeze_blocks: 0
freeze_blocks: 1
all_projections: True
num_layers: 50
use_pooling: False
use_batch_norm: True
dropout_rate: 0.0
training_precision: {
backend_floatx: FLOAT32
}
objective_set: {
cov {}
bbox {
scale: 35.0
offset: 0.5
}
}
}

bbox_rasterizer_config {
target_class_config {
key: “bird”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.67
}

postprocessing_config {
target_class_config {
key: “bird”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
}

cost_function_config {
target_classes {
name: “bird”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: True
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}

training_config {
batch_size_per_gpu: 16
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
enabled: False
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
}

augmentation_config {
preprocessing {
output_image_width: 512
output_image_height: 512
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
color_shift_stddev: 0.0
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
}
}

evaluation_config {
average_precision_mode: INTEGRATE
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “bird”
value: 0.7
}

evaluation_box_config {
key: “bird”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}

dataset_config {
data_sources: {
tfrecords_path: “/data/tfrecords/*”
image_directory_path: “/data/images/”
}
image_extension: “jpg”
target_class_mapping {
key: “bird”
value: “bird”"
}
validation_fold: 0
}

Morganh · January 17, 2021, 2:45am

Please attach your full training log as well.
Attach it as a file.

Sneaky_Turtle · January 17, 2021, 2:50am

Train Log
Spec File
example KITTI Label
example Image

trainlog1.16.21 (111.2 KB) train.txt (2.8 KB) 0.txt (64 Bytes)

Morganh · January 17, 2021, 2:53am

target_class_mapping {
key: “bird”
value: “bird”"
}

Please modify

value: “bird”"

to

value: “bird”

Sneaky_Turtle · January 17, 2021, 3:00am

The extra " on the target class mapping was a mechanical error when I copied the text.

Correct spec being used:
train-spec.txt (2.8 KB)

Morganh · January 17, 2021, 3:13am

Have you resized all your images to 512x512, and also modified their corresponding labels ?

Sneaky_Turtle · January 17, 2021, 3:21am

@Morganh Yes.

Script to resize (all images were under 512x512)

Source text: appended_record01.txt (1.3 MB)

for line in fp:
filename = (str(int_counter)+“.jpeg”)
line_array = line.split(" ")
img = Image.open(os.path.join(root_dir, line_array[2], line_array[3]))
int_x_dims = int(line_array[1])
int_y_dims = int(line_array[0])
x_pad_val_added = 512-int_x_dims
y_pad_val_added = 512-int_y_dims
bimg = ImageOps.expand(img, border=(0,0,x_pad_val_added,y_pad_val_added))
bimg.save(os.path.join(img_write_dir, filename))
int_counter += 1
print(“Image padded succesfully!”)

‘’’
#write annotations for padded images - check if float or not on dims
for line in fp:
filename = (str(int_counter)+“.txt”)
line_array = line.split(" “)
line_to_write_array = [“bird”, 0.00, 0, 0.00, line_array[5], line_array[6], float(line_array[5])+float(line_array[7]), float(line_array[6])+float(line_array[8]), 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]
file_to_write = open(os.path.join(label_write_dir, filename), “w”)
print((” “.join(str(x) for x in line_to_write_array)))
file_to_write.write((” ".join(str(x) for x in line_to_write_array)))
int_counter += 1
print(“Image label generated succesfully!”)
‘’’

Example Output:

0.txt (64 Bytes)

Topic		Replies	Views
TLT training error : Key cost_sums/cyclist-bbox not found in checkpoint TAO Toolkit	6	1195	October 12, 2021
Tlt-train loss is minimal but performances are bad TAO Toolkit	11	518	October 12, 2021
DSSD resume error TAO Toolkit	33	1443	March 17, 2022
Very slow initialization of training and first epoch TAO Toolkit	11	3059	September 30, 2021
CostFunctionConfig should have at least one class TAO Toolkit	8	841	October 12, 2021
tlt first tutorial error TAO Toolkit	3	770	October 12, 2021
Slow GPU workaround for NHWC error when training TAO Toolkit	7	1226	October 12, 2021
SSD: custom tlt training result in AP:0 for all my classes TAO Toolkit ssd , ai-training	6	1063	October 12, 2021
Error training Faster RCNN model TAO Toolkit	17	1554	October 12, 2021
Deepstream_lpr_app runs slowly TAO Toolkit	27	817	November 30, 2021

Errors in Training, 0 or Nan mAP, Low Loss, Tutorial Config

Related topics