Unable to train SSD-Resnet-34

Hi,

I am trying to train a SSD model with resnet-34 backbone

Here is the training spec file:


random_seed: 42
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet"
  nlayers: 34
  freeze_bn: false
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 120
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.15
    annealing: 0.5
    }
  }
  regularizer {
    type: L1
    weight: 3e-06
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1280
    output_image_height: 720
    output_image_channel: 3
    crop_right: 1280
    crop_bottom: 720
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/nitin/tlt-workspace/face_person_ssd/tf_records/*"
    image_directory_path: "/nitin/tlt-workspace/face_person_ssd/dataset"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "face"
      value: "face"
  }
  target_class_mapping {
      key: "person"
      value: "person"
  }

validation_fold: 0
}

pretrained weight used:

nvidia/tlt_pretrained_object_detection:resnet34

Error:

Using TensorFlow backend.
2020-05-21 08:44:34,158 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-05-21 08:45:18,799 [INFO] iva.ssd.scripts.train: Loading pretrained weights. This may take a while...
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Input (InputLayer)              (32, 3, 720, 1280)   0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (32, 64, 360, 640)   9408        Input[0][0]                      
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (32, 64, 360, 640)   256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_2 (Activation)       (32, 64, 360, 640)   0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (32, 64, 180, 320)   36864       activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (32, 64, 180, 320)   256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (32, 64, 180, 320)   0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (32, 64, 180, 320)   36864       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (32, 64, 180, 320)   4096        activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (32, 64, 180, 320)   256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (32, 64, 180, 320)   256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_17 (Add)                    (32, 64, 180, 320)   0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1a_relu (Activation)      (32, 64, 180, 320)   0           add_17[0][0]                     
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (32, 64, 180, 320)   36864       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (32, 64, 180, 320)   256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
block_1b_relu_1 (Activation)    (32, 64, 180, 320)   0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (32, 64, 180, 320)   36864       block_1b_relu_1[0][0]            
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (32, 64, 180, 320)   4096        block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (32, 64, 180, 320)   256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
block_1b_bn_shortcut (BatchNorm (32, 64, 180, 320)   256         block_1b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_18 (Add)                    (32, 64, 180, 320)   0           block_1b_bn_2[0][0]              
                                                                 block_1b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1b_relu (Activation)      (32, 64, 180, 320)   0           add_18[0][0]                     
__________________________________________________________________________________________________
block_1c_conv_1 (Conv2D)        (32, 64, 180, 320)   36864       block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_1c_bn_1 (BatchNormalizati (32, 64, 180, 320)   256         block_1c_conv_1[0][0]            
__________________________________________________________________________________________________
block_1c_relu_1 (Activation)    (32, 64, 180, 320)   0           block_1c_bn_1[0][0]              
__________________________________________________________________________________________________
block_1c_conv_2 (Conv2D)        (32, 64, 180, 320)   36864       block_1c_relu_1[0][0]            
__________________________________________________________________________________________________
block_1c_conv_shortcut (Conv2D) (32, 64, 180, 320)   4096        block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_1c_bn_2 (BatchNormalizati (32, 64, 180, 320)   256         block_1c_conv_2[0][0]            
__________________________________________________________________________________________________
block_1c_bn_shortcut (BatchNorm (32, 64, 180, 320)   256         block_1c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_19 (Add)                    (32, 64, 180, 320)   0           block_1c_bn_2[0][0]              
                                                                 block_1c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1c_relu (Activation)      (32, 64, 180, 320)   0           add_19[0][0]                     
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (32, 128, 90, 160)   73728       block_1c_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (32, 128, 90, 160)   512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (32, 128, 90, 160)   0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (32, 128, 90, 160)   147456      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (32, 128, 90, 160)   8192        block_1c_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (32, 128, 90, 160)   512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (32, 128, 90, 160)   512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_20 (Add)                    (32, 128, 90, 160)   0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2a_relu (Activation)      (32, 128, 90, 160)   0           add_20[0][0]                     
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (32, 128, 90, 160)   147456      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (32, 128, 90, 160)   512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
block_2b_relu_1 (Activation)    (32, 128, 90, 160)   0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (32, 128, 90, 160)   147456      block_2b_relu_1[0][0]            
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (32, 128, 90, 160)   16384       block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (32, 128, 90, 160)   512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
block_2b_bn_shortcut (BatchNorm (32, 128, 90, 160)   512         block_2b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_21 (Add)                    (32, 128, 90, 160)   0           block_2b_bn_2[0][0]              
                                                                 block_2b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2b_relu (Activation)      (32, 128, 90, 160)   0           add_21[0][0]                     
__________________________________________________________________________________________________
block_2c_conv_1 (Conv2D)        (32, 128, 90, 160)   147456      block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_2c_bn_1 (BatchNormalizati (32, 128, 90, 160)   512         block_2c_conv_1[0][0]            
__________________________________________________________________________________________________
block_2c_relu_1 (Activation)    (32, 128, 90, 160)   0           block_2c_bn_1[0][0]              
__________________________________________________________________________________________________
block_2c_conv_2 (Conv2D)        (32, 128, 90, 160)   147456      block_2c_relu_1[0][0]            
__________________________________________________________________________________________________
block_2c_conv_shortcut (Conv2D) (32, 128, 90, 160)   16384       block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_2c_bn_2 (BatchNormalizati (32, 128, 90, 160)   512         block_2c_conv_2[0][0]            
__________________________________________________________________________________________________
block_2c_bn_shortcut (BatchNorm (32, 128, 90, 160)   512         block_2c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_22 (Add)                    (32, 128, 90, 160)   0           block_2c_bn_2[0][0]              
                                                                 block_2c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2c_relu (Activation)      (32, 128, 90, 160)   0           add_22[0][0]                     
__________________________________________________________________________________________________
block_2d_conv_1 (Conv2D)        (32, 128, 90, 160)   147456      block_2c_relu[0][0]              
__________________________________________________________________________________________________
block_2d_bn_1 (BatchNormalizati (32, 128, 90, 160)   512         block_2d_conv_1[0][0]            
__________________________________________________________________________________________________
block_2d_relu_1 (Activation)    (32, 128, 90, 160)   0           block_2d_bn_1[0][0]              
__________________________________________________________________________________________________
block_2d_conv_2 (Conv2D)        (32, 128, 90, 160)   147456      block_2d_relu_1[0][0]            
__________________________________________________________________________________________________
block_2d_conv_shortcut (Conv2D) (32, 128, 90, 160)   16384       block_2c_relu[0][0]              
__________________________________________________________________________________________________
block_2d_bn_2 (BatchNormalizati (32, 128, 90, 160)   512         block_2d_conv_2[0][0]            
__________________________________________________________________________________________________
block_2d_bn_shortcut (BatchNorm (32, 128, 90, 160)   512         block_2d_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_23 (Add)                    (32, 128, 90, 160)   0           block_2d_bn_2[0][0]              
                                                                 block_2d_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2d_relu (Activation)      (32, 128, 90, 160)   0           add_23[0][0]                     
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (32, 256, 45, 80)    294912      block_2d_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (32, 256, 45, 80)    1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (32, 256, 45, 80)    0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (32, 256, 45, 80)    589824      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (32, 256, 45, 80)    32768       block_2d_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (32, 256, 45, 80)    1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (32, 256, 45, 80)    1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_24 (Add)                    (32, 256, 45, 80)    0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (32, 256, 45, 80)    0           add_24[0][0]                     
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (32, 256, 45, 80)    589824      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (32, 256, 45, 80)    1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (32, 256, 45, 80)    0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (32, 256, 45, 80)    589824      block_3b_relu_1[0][0]            
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (32, 256, 45, 80)    65536       block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (32, 256, 45, 80)    1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
block_3b_bn_shortcut (BatchNorm (32, 256, 45, 80)    1024        block_3b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_25 (Add)                    (32, 256, 45, 80)    0           block_3b_bn_2[0][0]              
                                                                 block_3b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3b_relu (Activation)      (32, 256, 45, 80)    0           add_25[0][0]                     
__________________________________________________________________________________________________
block_3c_conv_1 (Conv2D)        (32, 256, 45, 80)    589824      block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_3c_bn_1 (BatchNormalizati (32, 256, 45, 80)    1024        block_3c_conv_1[0][0]            
__________________________________________________________________________________________________
block_3c_relu_1 (Activation)    (32, 256, 45, 80)    0           block_3c_bn_1[0][0]              
__________________________________________________________________________________________________
block_3c_conv_2 (Conv2D)        (32, 256, 45, 80)    589824      block_3c_relu_1[0][0]            
__________________________________________________________________________________________________
block_3c_conv_shortcut (Conv2D) (32, 256, 45, 80)    65536       block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_3c_bn_2 (BatchNormalizati (32, 256, 45, 80)    1024        block_3c_conv_2[0][0]            
__________________________________________________________________________________________________
block_3c_bn_shortcut (BatchNorm (32, 256, 45, 80)    1024        block_3c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_26 (Add)                    (32, 256, 45, 80)    0           block_3c_bn_2[0][0]              
                                                                 block_3c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3c_relu (Activation)      (32, 256, 45, 80)    0           add_26[0][0]                     
__________________________________________________________________________________________________
block_3d_conv_1 (Conv2D)        (32, 256, 45, 80)    589824      block_3c_relu[0][0]              
__________________________________________________________________________________________________
block_3d_bn_1 (BatchNormalizati (32, 256, 45, 80)    1024        block_3d_conv_1[0][0]            
__________________________________________________________________________________________________
block_3d_relu_1 (Activation)    (32, 256, 45, 80)    0           block_3d_bn_1[0][0]              
__________________________________________________________________________________________________
block_3d_conv_2 (Conv2D)        (32, 256, 45, 80)    589824      block_3d_relu_1[0][0]            
__________________________________________________________________________________________________
block_3d_conv_shortcut (Conv2D) (32, 256, 45, 80)    65536       block_3c_relu[0][0]              
__________________________________________________________________________________________________
block_3d_bn_2 (BatchNormalizati (32, 256, 45, 80)    1024        block_3d_conv_2[0][0]            
__________________________________________________________________________________________________
block_3d_bn_shortcut (BatchNorm (32, 256, 45, 80)    1024        block_3d_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_27 (Add)                    (32, 256, 45, 80)    0           block_3d_bn_2[0][0]              
                                                                 block_3d_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3d_relu (Activation)      (32, 256, 45, 80)    0           add_27[0][0]                     
__________________________________________________________________________________________________
block_3e_conv_1 (Conv2D)        (32, 256, 45, 80)    589824      block_3d_relu[0][0]              
__________________________________________________________________________________________________
block_3e_bn_1 (BatchNormalizati (32, 256, 45, 80)    1024        block_3e_conv_1[0][0]            
__________________________________________________________________________________________________
block_3e_relu_1 (Activation)    (32, 256, 45, 80)    0           block_3e_bn_1[0][0]              
__________________________________________________________________________________________________
block_3e_conv_2 (Conv2D)        (32, 256, 45, 80)    589824      block_3e_relu_1[0][0]            
__________________________________________________________________________________________________
block_3e_conv_shortcut (Conv2D) (32, 256, 45, 80)    65536       block_3d_relu[0][0]              
__________________________________________________________________________________________________
block_3e_bn_2 (BatchNormalizati (32, 256, 45, 80)    1024        block_3e_conv_2[0][0]            
__________________________________________________________________________________________________
block_3e_bn_shortcut (BatchNorm (32, 256, 45, 80)    1024        block_3e_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_28 (Add)                    (32, 256, 45, 80)    0           block_3e_bn_2[0][0]              
                                                                 block_3e_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3e_relu (Activation)      (32, 256, 45, 80)    0           add_28[0][0]                     
__________________________________________________________________________________________________
block_3f_conv_1 (Conv2D)        (32, 256, 45, 80)    589824      block_3e_relu[0][0]              
__________________________________________________________________________________________________
block_3f_bn_1 (BatchNormalizati (32, 256, 45, 80)    1024        block_3f_conv_1[0][0]            
__________________________________________________________________________________________________
block_3f_relu_1 (Activation)    (32, 256, 45, 80)    0           block_3f_bn_1[0][0]              
__________________________________________________________________________________________________
block_3f_conv_2 (Conv2D)        (32, 256, 45, 80)    589824      block_3f_relu_1[0][0]            
__________________________________________________________________________________________________
block_3f_conv_shortcut (Conv2D) (32, 256, 45, 80)    65536       block_3e_relu[0][0]              
__________________________________________________________________________________________________
block_3f_bn_2 (BatchNormalizati (32, 256, 45, 80)    1024        block_3f_conv_2[0][0]            
__________________________________________________________________________________________________
block_3f_bn_shortcut (BatchNorm (32, 256, 45, 80)    1024        block_3f_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_29 (Add)                    (32, 256, 45, 80)    0           block_3f_bn_2[0][0]              
                                                                 block_3f_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3f_relu (Activation)      (32, 256, 45, 80)    0           add_29[0][0]                     
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (32, 512, 45, 80)    1179648     block_3f_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (32, 512, 45, 80)    2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (32, 512, 45, 80)    0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (32, 512, 45, 80)    2359296     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (32, 512, 45, 80)    131072      block_3f_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (32, 512, 45, 80)    2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (32, 512, 45, 80)    2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_30 (Add)                    (32, 512, 45, 80)    0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (32, 512, 45, 80)    0           add_30[0][0]                     
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (32, 512, 45, 80)    2359296     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (32, 512, 45, 80)    2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (32, 512, 45, 80)    0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (32, 512, 45, 80)    2359296     block_4b_relu_1[0][0]            
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (32, 512, 45, 80)    262144      block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (32, 512, 45, 80)    2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
block_4b_bn_shortcut (BatchNorm (32, 512, 45, 80)    2048        block_4b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_31 (Add)                    (32, 512, 45, 80)    0           block_4b_bn_2[0][0]              
                                                                 block_4b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4b_relu (Activation)      (32, 512, 45, 80)    0           add_31[0][0]                     
__________________________________________________________________________________________________
block_4c_conv_1 (Conv2D)        (32, 512, 45, 80)    2359296     block_4b_relu[0][0]              
__________________________________________________________________________________________________
block_4c_bn_1 (BatchNormalizati (32, 512, 45, 80)    2048        block_4c_conv_1[0][0]            
__________________________________________________________________________________________________
block_4c_relu_1 (Activation)    (32, 512, 45, 80)    0           block_4c_bn_1[0][0]              
__________________________________________________________________________________________________
block_4c_conv_2 (Conv2D)        (32, 512, 45, 80)    2359296     block_4c_relu_1[0][0]            
__________________________________________________________________________________________________
block_4c_conv_shortcut (Conv2D) (32, 512, 45, 80)    262144      block_4b_relu[0][0]              
__________________________________________________________________________________________________
block_4c_bn_2 (BatchNormalizati (32, 512, 45, 80)    2048        block_4c_conv_2[0][0]            
__________________________________________________________________________________________________
block_4c_bn_shortcut (BatchNorm (32, 512, 45, 80)    2048        block_4c_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_32 (Add)                    (32, 512, 45, 80)    0           block_4c_bn_2[0][0]              
                                                                 block_4c_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4c_relu (Activation)      (32, 512, 45, 80)    0           add_32[0][0]                     
__________________________________________________________________________________________________
ssd_expand_block_0_conv_0 (Conv (32, 256, 45, 80)    131328      block_4c_relu[0][0]              
__________________________________________________________________________________________________
ssd_expand_block_0_relu_0 (ReLU (32, 256, 45, 80)    0           ssd_expand_block_0_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_0_conv_1 (Conv (32, 256, 45, 80)    589824      ssd_expand_block_0_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_0_bn_1 (BatchN (32, 256, 45, 80)    1024        ssd_expand_block_0_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_0_relu_1 (ReLU (32, 256, 45, 80)    0           ssd_expand_block_0_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_1_conv_0 (Conv (32, 128, 45, 80)    32896       ssd_expand_block_0_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_relu_0 (ReLU (32, 128, 45, 80)    0           ssd_expand_block_1_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_conv_1 (Conv (32, 256, 23, 40)    294912      ssd_expand_block_1_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_bn_1 (BatchN (32, 256, 23, 40)    1024        ssd_expand_block_1_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_relu_1 (ReLU (32, 256, 23, 40)    0           ssd_expand_block_1_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_2_conv_0 (Conv (32, 64, 23, 40)     16448       ssd_expand_block_1_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_relu_0 (ReLU (32, 64, 23, 40)     0           ssd_expand_block_2_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_conv_1 (Conv (32, 128, 12, 20)    73728       ssd_expand_block_2_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_bn_1 (BatchN (32, 128, 12, 20)    512         ssd_expand_block_2_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_relu_1 (ReLU (32, 128, 12, 20)    0           ssd_expand_block_2_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_3_conv_0 (Conv (32, 64, 12, 20)     8256        ssd_expand_block_2_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_relu_0 (ReLU (32, 64, 12, 20)     0           ssd_expand_block_3_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_conv_1 (Conv (32, 128, 6, 10)     73728       ssd_expand_block_3_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_bn_1 (BatchN (32, 128, 6, 10)     512         ssd_expand_block_3_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_relu_1 (ReLU (32, 128, 6, 10)     0           ssd_expand_block_3_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_4_conv_0 (Conv (32, 64, 6, 10)      8256        ssd_expand_block_3_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_relu_0 (ReLU (32, 64, 6, 10)      0           ssd_expand_block_4_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_conv_1 (Conv (32, 128, 3, 5)      73728       ssd_expand_block_4_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_bn_1 (BatchN (32, 128, 3, 5)      512         ssd_expand_block_4_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_relu_1 (ReLU (32, 128, 3, 5)      0           ssd_expand_block_4_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_conf_0 (Conv2D)             (32, 12, 90, 160)    13836       block_2d_relu[0][0]              
__________________________________________________________________________________________________
ssd_conf_1 (Conv2D)             (32, 12, 45, 80)     27660       ssd_expand_block_0_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_2 (Conv2D)             (32, 12, 23, 40)     27660       ssd_expand_block_1_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_3 (Conv2D)             (32, 12, 12, 20)     13836       ssd_expand_block_2_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_4 (Conv2D)             (32, 12, 6, 10)      13836       ssd_expand_block_3_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_5 (Conv2D)             (32, 12, 3, 5)       13836       ssd_expand_block_4_relu_1[0][0]  
__________________________________________________________________________________________________
permute_13 (Permute)            (32, 90, 160, 12)    0           ssd_conf_0[0][0]                 
__________________________________________________________________________________________________
permute_15 (Permute)            (32, 45, 80, 12)     0           ssd_conf_1[0][0]                 
__________________________________________________________________________________________________
permute_17 (Permute)            (32, 23, 40, 12)     0           ssd_conf_2[0][0]                 
__________________________________________________________________________________________________
permute_19 (Permute)            (32, 12, 20, 12)     0           ssd_conf_3[0][0]                 
__________________________________________________________________________________________________
permute_21 (Permute)            (32, 6, 10, 12)      0           ssd_conf_4[0][0]                 
__________________________________________________________________________________________________
permute_23 (Permute)            (32, 3, 5, 12)       0           ssd_conf_5[0][0]                 
__________________________________________________________________________________________________
ssd_loc_0 (Conv2D)              (32, 24, 90, 160)    27672       block_2d_relu[0][0]              
__________________________________________________________________________________________________
ssd_loc_1 (Conv2D)              (32, 24, 45, 80)     55320       ssd_expand_block_0_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_2 (Conv2D)              (32, 24, 23, 40)     55320       ssd_expand_block_1_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_3 (Conv2D)              (32, 24, 12, 20)     27672       ssd_expand_block_2_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_4 (Conv2D)              (32, 24, 6, 10)      27672       ssd_expand_block_3_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_5 (Conv2D)              (32, 24, 3, 5)       27672       ssd_expand_block_4_relu_1[0][0]  
__________________________________________________________________________________________________
conf_reshape_0 (Reshape)        (32, 86400, 1, 2)    0           permute_13[0][0]                 
__________________________________________________________________________________________________
conf_reshape_1 (Reshape)        (32, 21600, 1, 2)    0           permute_15[0][0]                 
__________________________________________________________________________________________________
conf_reshape_2 (Reshape)        (32, 5520, 1, 2)     0           permute_17[0][0]                 
__________________________________________________________________________________________________
conf_reshape_3 (Reshape)        (32, 1440, 1, 2)     0           permute_19[0][0]                 
__________________________________________________________________________________________________
conf_reshape_4 (Reshape)        (32, 360, 1, 2)      0           permute_21[0][0]                 
__________________________________________________________________________________________________
conf_reshape_5 (Reshape)        (32, 90, 1, 2)       0           permute_23[0][0]                 
__________________________________________________________________________________________________
permute_14 (Permute)            (32, 90, 160, 24)    0           ssd_loc_0[0][0]                  
__________________________________________________________________________________________________
permute_16 (Permute)            (32, 45, 80, 24)     0           ssd_loc_1[0][0]                  
__________________________________________________________________________________________________
permute_18 (Permute)            (32, 23, 40, 24)     0           ssd_loc_2[0][0]                  
__________________________________________________________________________________________________
permute_20 (Permute)            (32, 12, 20, 24)     0           ssd_loc_3[0][0]                  
__________________________________________________________________________________________________
permute_22 (Permute)            (32, 6, 10, 24)      0           ssd_loc_4[0][0]                  
__________________________________________________________________________________________________
permute_24 (Permute)            (32, 3, 5, 24)       0           ssd_loc_5[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_0 (AnchorBoxes)      (32, 14400, 6, 8)    0           ssd_loc_0[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_1 (AnchorBoxes)      (32, 3600, 6, 8)     0           ssd_loc_1[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_2 (AnchorBoxes)      (32, 920, 6, 8)      0           ssd_loc_2[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_3 (AnchorBoxes)      (32, 240, 6, 8)      0           ssd_loc_3[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_4 (AnchorBoxes)      (32, 60, 6, 8)       0           ssd_loc_4[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_5 (AnchorBoxes)      (32, 15, 6, 8)       0           ssd_loc_5[0][0]                  
__________________________________________________________________________________________________
mbox_conf (Concatenate)         (32, 115410, 1, 2)   0           conf_reshape_0[0][0]             
                                                                 conf_reshape_1[0][0]             
                                                                 conf_reshape_2[0][0]             
                                                                 conf_reshape_3[0][0]             
                                                                 conf_reshape_4[0][0]             
                                                                 conf_reshape_5[0][0]             
__________________________________________________________________________________________________
loc_reshape_0 (Reshape)         (32, 86400, 1, 4)    0           permute_14[0][0]                 
__________________________________________________________________________________________________
loc_reshape_1 (Reshape)         (32, 21600, 1, 4)    0           permute_16[0][0]                 
__________________________________________________________________________________________________
loc_reshape_2 (Reshape)         (32, 5520, 1, 4)     0           permute_18[0][0]                 
__________________________________________________________________________________________________
loc_reshape_3 (Reshape)         (32, 1440, 1, 4)     0           permute_20[0][0]                 
__________________________________________________________________________________________________
loc_reshape_4 (Reshape)         (32, 360, 1, 4)      0           permute_22[0][0]                 
__________________________________________________________________________________________________
loc_reshape_5 (Reshape)         (32, 90, 1, 4)       0           permute_24[0][0]                 
__________________________________________________________________________________________________
anchor_reshape_0 (Reshape)      (32, 86400, 1, 8)    0           ssd_anchor_0[0][0]               
__________________________________________________________________________________________________
anchor_reshape_1 (Reshape)      (32, 21600, 1, 8)    0           ssd_anchor_1[0][0]               
__________________________________________________________________________________________________
anchor_reshape_2 (Reshape)      (32, 5520, 1, 8)     0           ssd_anchor_2[0][0]               
__________________________________________________________________________________________________
anchor_reshape_3 (Reshape)      (32, 1440, 1, 8)     0           ssd_anchor_3[0][0]               
__________________________________________________________________________________________________
anchor_reshape_4 (Reshape)      (32, 360, 1, 8)      0           ssd_anchor_4[0][0]               
__________________________________________________________________________________________________
anchor_reshape_5 (Reshape)      (32, 90, 1, 8)       0           ssd_anchor_5[0][0]               
__________________________________________________________________________________________________
mbox_conf_sigmoid (Activation)  (32, 115410, 1, 2)   0           mbox_conf[0][0]                  
__________________________________________________________________________________________________
mbox_loc (Concatenate)          (32, 115410, 1, 4)   0           loc_reshape_0[0][0]              
                                                                 loc_reshape_1[0][0]              
                                                                 loc_reshape_2[0][0]              
                                                                 loc_reshape_3[0][0]              
                                                                 loc_reshape_4[0][0]              
                                                                 loc_reshape_5[0][0]              
__________________________________________________________________________________________________
mbox_priorbox (Concatenate)     (32, 115410, 1, 8)   0           anchor_reshape_0[0][0]           
                                                                 anchor_reshape_1[0][0]           
                                                                 anchor_reshape_2[0][0]           
                                                                 anchor_reshape_3[0][0]           
                                                                 anchor_reshape_4[0][0]           
                                                                 anchor_reshape_5[0][0]           
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (32, 115410, 1, 14)  0           mbox_conf_sigmoid[0][0]          
                                                                 mbox_loc[0][0]                   
                                                                 mbox_priorbox[0][0]              
__________________________________________________________________________________________________
ssd_predictions (Reshape)       (32, 115410, 14)     0           concatenate_2[0][0]              
==================================================================================================
Total params: 23,865,304
Trainable params: 23,840,728
Non-trainable params: 24,576
__________________________________________________________________________________________________
2020-05-21 08:57:49,560 [INFO] iva.ssd.scripts.train: Number of images in the training dataset:	 18810
2020-05-21 08:57:49,561 [INFO] iva.ssd.scripts.train: Number of images in the validation dataset:	  3062
Epoch 1/120
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 182, in run_experiment
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_arrays.py", line 154, in fit_loop
    outs = f(ins)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,512,45,80] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node block_4b_bn_2_1/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node loss_1/add_70}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

please help, what’s the exact problem.

Thanks

This is OOM issue. Please consider below.

  1. lower batch-size
    or 2) resize images/labels to lower resolution
    or 3) run with multi-gpus

I tried a lower batch size and also reduced image size

Error:

Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
2020-05-21 09:52:34,733 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
2020-05-21 09:52:34,743 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
2020-05-21 09:52:34,756 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
2020-05-21 09:52:34,807 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
2020-05-21 09:52:34,807 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
2020-05-21 09:52:34,811 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
2020-05-21 09:52:34,839 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
  File "./ssd/scripts/train.py", line 245, in main
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
: truncated record at 1223947
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
2020-05-21 09:52:34,852 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/train.txt
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "./ssd/scripts/train.py", line 245, in main
  File "./ssd/scripts/train.py", line 95, in run_experiment
  File "./ssd/builders/model_builder.py", line 79, in build
  File "./ssd/builders/inputs_builder.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 187, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 138, in get_tfrecords_iterator
  File "./detectnet_v2/dataloader/utilities.py", line 114, in get_num_samples
  File "./detectnet_v2/dataloader/utilities.py", line 113, in <genexpr>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 489, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 1223947
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[47740,1],7]
  Exit code:    1
--------------------------------------------------------------------------

For “resize images/labels to lower resolution“,need to write scripts to resize your current images/labels offline, and set the width/height accordingly in the spec.

yes that is already taken care of.

What is your current resolution of training images?

Morgan,

The images are resized to 300x300 @ batch size of 32 on DGX1, with ResNet 34. It frooze, we will try to rerun the experiment and keep you posted.

Thanks Ravik for the info. Please let me know if any issue needs help.

1 Like

Hi Morgan,

Issue is solved.