NaN loss error

Hello,

Since the release of TLT 2 we struggle with NaN loss error during training.

tested architecture: SSD, Retinanet

we’ve done several tests:

  • Changing the learning rate schedule
  • Changing the batch size
  • Training the same architecture on differents data (with differents image size)

We observe that the most NaN loss error appear when the batch size is set to 1 (but we have to set to 1 to avoid OOM error).

When we train a model with a batch size of 16, the training is completed, if we just change the batch size to 1, the NaN loss error appears.

Thanks

Example of spec file with a NaN loss error (but work with batch size of 1):

random_seed: 42
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "mobilenet_v2"
  freeze_bn: false
}
training_config {
  batch_size_per_gpu: 1
  num_epochs: 100
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.01
    annealing: 0.1
    }
  }
  regularizer {
    type: L1
    weight: 3e-06
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 1
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1200
    output_image_height: 400
    output_image_channel: 3
    crop_right: 1200
    crop_bottom: 400
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/training/data/tfrecords/kitti_trainval*"
    image_directory_path: "/workspace/training/data/train"
  }
  image_extension: "png"
  target_class_mapping {
      key: "myclass"
      value: "myclass"
  }
validation_fold: 0
}

I’m a little confusing. You mention that “train a batch size of 16, the training is completed”. So the triaining runs smoothly without OOM. Why do you say “have to set bs to 1 to avoid OOM error?”
BTW, could you run “nvidia-smi” and paste the log? Thanks.

We have made a lot of test:

  • Train with image size 1280x1024, we have to set the batch size to 1: NaN loss error

  • Train with image size 1200x400, we can set the batch size to 16: training completed

  • Train with same image size 1200x400, we just change the batch size to 1: NaN loss error

Output of nvidia-smi:

Mon May 25 11:00:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
| 21%   45C    P5    23W / 250W |    610MiB / 10986MiB |     30%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1235      G   /usr/lib/xorg/Xorg                           184MiB |
|    0      1413      G   /usr/bin/gnome-shell                         190MiB |
|    0      1865      G   /opt/----------------------------             16MiB |
|    0     19290      G   /usr/lib/firefox/firefox                       6MiB |
|    0     21083      G   /data/--------                               200MiB |
|    0     27856      G   /usr/lib/firefox/firefox                       6MiB |
+-----------------------------------------------------------------------------+

If possible, please share the training spec of your three experiments? Or you can tell me their difference, only bs or image-size?
Please share the full log which has NaN error too. Thanks.

More, seems that your " annealing: 0.1" is too small. Suggest to enlarge it to 0.5 or 0.7.

We have made a lot of tests, changing only the batch size, the image size and/or learning rate scheduling.

We have tested with a bigger annealing, it’s doesn’t work better.
We think that the problem is not related with the annealing because the NaN loss appears mainly during the 10 first epochs.

One of logs with NaN loss:

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
Using TensorFlow backend.
2020-05-05 12:58:06,232 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from /workspace/training/specs/ssd_train_mobilenetv2_kitti.txt
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-05-05 12:58:08,910 [INFO] iva.ssd.scripts.train: Loading pretrained weights. This may take a while...
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Input (InputLayer)              (1, 3, 384, 1184)    0                                            
__________________________________________________________________________________________________
conv1_pad (ZeroPadding2D)       (1, 3, 386, 1186)    0           Input[0][0]                      
__________________________________________________________________________________________________
conv1 (Conv2D)                  (1, 32, 192, 592)    864         conv1_pad[0][0]                  
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (1, 32, 192, 592)    128         conv1[0][0]                      
__________________________________________________________________________________________________
re_lu_1 (ReLU)                  (1, 32, 192, 592)    0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
expanded_conv_depthwise_pad (Ze (1, 32, 194, 594)    0           re_lu_1[0][0]                    
__________________________________________________________________________________________________
expanded_conv_depthwise (Depthw (1, 32, 192, 592)    288         expanded_conv_depthwise_pad[0][0]
__________________________________________________________________________________________________
expanded_conv_depthwise_bn (Bat (1, 32, 192, 592)    128         expanded_conv_depthwise[0][0]    
__________________________________________________________________________________________________
expanded_conv_relu (ReLU)       (1, 32, 192, 592)    0           expanded_conv_depthwise_bn[0][0] 
__________________________________________________________________________________________________
expanded_conv_project (Conv2D)  (1, 16, 192, 592)    512         expanded_conv_relu[0][0]         
__________________________________________________________________________________________________
expanded_conv_project_bn (Batch (1, 16, 192, 592)    64          expanded_conv_project[0][0]      
__________________________________________________________________________________________________
block_1_expand (Conv2D)         (1, 96, 192, 592)    1536        expanded_conv_project_bn[0][0]   
__________________________________________________________________________________________________
block_1_expand_bn (BatchNormali (1, 96, 192, 592)    384         block_1_expand[0][0]             
__________________________________________________________________________________________________
re_lu_2 (ReLU)                  (1, 96, 192, 592)    0           block_1_expand_bn[0][0]          
__________________________________________________________________________________________________
block_1_depthwise_pad (ZeroPadd (1, 96, 194, 594)    0           re_lu_2[0][0]                    
__________________________________________________________________________________________________
block_1_depthwise (DepthwiseCon (1, 96, 96, 296)     864         block_1_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_1_depthwise_bn (BatchNorm (1, 96, 96, 296)     384         block_1_depthwise[0][0]          
__________________________________________________________________________________________________
block_1_relu (ReLU)             (1, 96, 96, 296)     0           block_1_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_1_project (Conv2D)        (1, 24, 96, 296)     2304        block_1_relu[0][0]               
__________________________________________________________________________________________________
block_1_project_bn (BatchNormal (1, 24, 96, 296)     96          block_1_project[0][0]            
__________________________________________________________________________________________________
block_2_expand (Conv2D)         (1, 144, 96, 296)    3456        block_1_project_bn[0][0]         
__________________________________________________________________________________________________
block_2_expand_bn (BatchNormali (1, 144, 96, 296)    576         block_2_expand[0][0]             
__________________________________________________________________________________________________
re_lu_3 (ReLU)                  (1, 144, 96, 296)    0           block_2_expand_bn[0][0]          
__________________________________________________________________________________________________
block_2_depthwise_pad (ZeroPadd (1, 144, 98, 298)    0           re_lu_3[0][0]                    
__________________________________________________________________________________________________
block_2_depthwise (DepthwiseCon (1, 144, 96, 296)    1296        block_2_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_2_depthwise_bn (BatchNorm (1, 144, 96, 296)    576         block_2_depthwise[0][0]          
__________________________________________________________________________________________________
block_2_relu (ReLU)             (1, 144, 96, 296)    0           block_2_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_2_project (Conv2D)        (1, 24, 96, 296)     3456        block_2_relu[0][0]               
__________________________________________________________________________________________________
block_2_projected_inputs (Conv2 (1, 24, 96, 296)     576         block_1_project_bn[0][0]         
__________________________________________________________________________________________________
block_2_project_bn (BatchNormal (1, 24, 96, 296)     96          block_2_project[0][0]            
__________________________________________________________________________________________________
block_2_add (Add)               (1, 24, 96, 296)     0           block_2_projected_inputs[0][0]   
                                                                 block_2_project_bn[0][0]         
__________________________________________________________________________________________________
block_3_expand (Conv2D)         (1, 144, 96, 296)    3456        block_2_add[0][0]                
__________________________________________________________________________________________________
block_3_expand_bn (BatchNormali (1, 144, 96, 296)    576         block_3_expand[0][0]             
__________________________________________________________________________________________________
re_lu_4 (ReLU)                  (1, 144, 96, 296)    0           block_3_expand_bn[0][0]          
__________________________________________________________________________________________________
block_3_depthwise_pad (ZeroPadd (1, 144, 98, 298)    0           re_lu_4[0][0]                    
__________________________________________________________________________________________________
block_3_depthwise (DepthwiseCon (1, 144, 48, 148)    1296        block_3_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_3_depthwise_bn (BatchNorm (1, 144, 48, 148)    576         block_3_depthwise[0][0]          
__________________________________________________________________________________________________
block_3_relu (ReLU)             (1, 144, 48, 148)    0           block_3_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_3_project (Conv2D)        (1, 32, 48, 148)     4608        block_3_relu[0][0]               
__________________________________________________________________________________________________
block_3_project_bn (BatchNormal (1, 32, 48, 148)     128         block_3_project[0][0]            
__________________________________________________________________________________________________
block_4_expand (Conv2D)         (1, 192, 48, 148)    6144        block_3_project_bn[0][0]         
__________________________________________________________________________________________________
block_4_expand_bn (BatchNormali (1, 192, 48, 148)    768         block_4_expand[0][0]             
__________________________________________________________________________________________________
re_lu_5 (ReLU)                  (1, 192, 48, 148)    0           block_4_expand_bn[0][0]          
__________________________________________________________________________________________________
block_4_depthwise_pad (ZeroPadd (1, 192, 50, 150)    0           re_lu_5[0][0]                    
__________________________________________________________________________________________________
block_4_depthwise (DepthwiseCon (1, 192, 48, 148)    1728        block_4_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_4_depthwise_bn (BatchNorm (1, 192, 48, 148)    768         block_4_depthwise[0][0]          
__________________________________________________________________________________________________
block_4_relu (ReLU)             (1, 192, 48, 148)    0           block_4_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_4_project (Conv2D)        (1, 32, 48, 148)     6144        block_4_relu[0][0]               
__________________________________________________________________________________________________
block_4_projected_inputs (Conv2 (1, 32, 48, 148)     1024        block_3_project_bn[0][0]         
__________________________________________________________________________________________________
block_4_project_bn (BatchNormal (1, 32, 48, 148)     128         block_4_project[0][0]            
__________________________________________________________________________________________________
block_4_add (Add)               (1, 32, 48, 148)     0           block_4_projected_inputs[0][0]   
                                                                 block_4_project_bn[0][0]         
__________________________________________________________________________________________________
block_5_expand (Conv2D)         (1, 192, 48, 148)    6144        block_4_add[0][0]                
__________________________________________________________________________________________________
block_5_expand_bn (BatchNormali (1, 192, 48, 148)    768         block_5_expand[0][0]             
__________________________________________________________________________________________________
re_lu_6 (ReLU)                  (1, 192, 48, 148)    0           block_5_expand_bn[0][0]          
__________________________________________________________________________________________________
block_5_depthwise_pad (ZeroPadd (1, 192, 50, 150)    0           re_lu_6[0][0]                    
__________________________________________________________________________________________________
block_5_depthwise (DepthwiseCon (1, 192, 48, 148)    1728        block_5_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_5_depthwise_bn (BatchNorm (1, 192, 48, 148)    768         block_5_depthwise[0][0]          
__________________________________________________________________________________________________
block_5_relu (ReLU)             (1, 192, 48, 148)    0           block_5_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_5_project (Conv2D)        (1, 32, 48, 148)     6144        block_5_relu[0][0]               
__________________________________________________________________________________________________
block_5_projected_inputs (Conv2 (1, 32, 48, 148)     1024        block_4_add[0][0]                
__________________________________________________________________________________________________
block_5_project_bn (BatchNormal (1, 32, 48, 148)     128         block_5_project[0][0]            
__________________________________________________________________________________________________
block_5_add (Add)               (1, 32, 48, 148)     0           block_5_projected_inputs[0][0]   
                                                                 block_5_project_bn[0][0]         
__________________________________________________________________________________________________
block_6_expand (Conv2D)         (1, 192, 48, 148)    6144        block_5_add[0][0]                
__________________________________________________________________________________________________
block_6_expand_bn (BatchNormali (1, 192, 48, 148)    768         block_6_expand[0][0]             
__________________________________________________________________________________________________
re_lu_7 (ReLU)                  (1, 192, 48, 148)    0           block_6_expand_bn[0][0]          
__________________________________________________________________________________________________
block_6_depthwise_pad (ZeroPadd (1, 192, 50, 150)    0           re_lu_7[0][0]                    
__________________________________________________________________________________________________
block_6_depthwise (DepthwiseCon (1, 192, 24, 74)     1728        block_6_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_6_depthwise_bn (BatchNorm (1, 192, 24, 74)     768         block_6_depthwise[0][0]          
__________________________________________________________________________________________________
block_6_relu (ReLU)             (1, 192, 24, 74)     0           block_6_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_6_project (Conv2D)        (1, 64, 24, 74)      12288       block_6_relu[0][0]               
__________________________________________________________________________________________________
block_6_project_bn (BatchNormal (1, 64, 24, 74)      256         block_6_project[0][0]            
__________________________________________________________________________________________________
block_7_expand (Conv2D)         (1, 384, 24, 74)     24576       block_6_project_bn[0][0]         
__________________________________________________________________________________________________
block_7_expand_bn (BatchNormali (1, 384, 24, 74)     1536        block_7_expand[0][0]             
__________________________________________________________________________________________________
re_lu_8 (ReLU)                  (1, 384, 24, 74)     0           block_7_expand_bn[0][0]          
__________________________________________________________________________________________________
block_7_depthwise_pad (ZeroPadd (1, 384, 26, 76)     0           re_lu_8[0][0]                    
__________________________________________________________________________________________________
block_7_depthwise (DepthwiseCon (1, 384, 24, 74)     3456        block_7_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_7_depthwise_bn (BatchNorm (1, 384, 24, 74)     1536        block_7_depthwise[0][0]          
__________________________________________________________________________________________________
block_7_relu (ReLU)             (1, 384, 24, 74)     0           block_7_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_7_project (Conv2D)        (1, 64, 24, 74)      24576       block_7_relu[0][0]               
__________________________________________________________________________________________________
block_7_projected_inputs (Conv2 (1, 64, 24, 74)      4096        block_6_project_bn[0][0]         
__________________________________________________________________________________________________
block_7_project_bn (BatchNormal (1, 64, 24, 74)      256         block_7_project[0][0]            
__________________________________________________________________________________________________
block_7_add (Add)               (1, 64, 24, 74)      0           block_7_projected_inputs[0][0]   
                                                                 block_7_project_bn[0][0]         
__________________________________________________________________________________________________
block_8_expand (Conv2D)         (1, 384, 24, 74)     24576       block_7_add[0][0]                
__________________________________________________________________________________________________
block_8_expand_bn (BatchNormali (1, 384, 24, 74)     1536        block_8_expand[0][0]             
__________________________________________________________________________________________________
re_lu_9 (ReLU)                  (1, 384, 24, 74)     0           block_8_expand_bn[0][0]          
__________________________________________________________________________________________________
block_8_depthwise_pad (ZeroPadd (1, 384, 26, 76)     0           re_lu_9[0][0]                    
__________________________________________________________________________________________________
block_8_depthwise (DepthwiseCon (1, 384, 24, 74)     3456        block_8_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_8_depthwise_bn (BatchNorm (1, 384, 24, 74)     1536        block_8_depthwise[0][0]          
__________________________________________________________________________________________________
block_8_relu (ReLU)             (1, 384, 24, 74)     0           block_8_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_8_project (Conv2D)        (1, 64, 24, 74)      24576       block_8_relu[0][0]               
__________________________________________________________________________________________________
block_8_projected_inputs (Conv2 (1, 64, 24, 74)      4096        block_7_add[0][0]                
__________________________________________________________________________________________________
block_8_project_bn (BatchNormal (1, 64, 24, 74)      256         block_8_project[0][0]            
__________________________________________________________________________________________________
block_8_add (Add)               (1, 64, 24, 74)      0           block_8_projected_inputs[0][0]   
                                                                 block_8_project_bn[0][0]         
__________________________________________________________________________________________________
block_9_expand (Conv2D)         (1, 384, 24, 74)     24576       block_8_add[0][0]                
__________________________________________________________________________________________________
block_9_expand_bn (BatchNormali (1, 384, 24, 74)     1536        block_9_expand[0][0]             
__________________________________________________________________________________________________
re_lu_10 (ReLU)                 (1, 384, 24, 74)     0           block_9_expand_bn[0][0]          
__________________________________________________________________________________________________
block_9_depthwise_pad (ZeroPadd (1, 384, 26, 76)     0           re_lu_10[0][0]                   
__________________________________________________________________________________________________
block_9_depthwise (DepthwiseCon (1, 384, 24, 74)     3456        block_9_depthwise_pad[0][0]      
__________________________________________________________________________________________________
block_9_depthwise_bn (BatchNorm (1, 384, 24, 74)     1536        block_9_depthwise[0][0]          
__________________________________________________________________________________________________
block_9_relu (ReLU)             (1, 384, 24, 74)     0           block_9_depthwise_bn[0][0]       
__________________________________________________________________________________________________
block_9_project (Conv2D)        (1, 64, 24, 74)      24576       block_9_relu[0][0]               
__________________________________________________________________________________________________
block_9_projected_inputs (Conv2 (1, 64, 24, 74)      4096        block_8_add[0][0]                
__________________________________________________________________________________________________
block_9_project_bn (BatchNormal (1, 64, 24, 74)      256         block_9_project[0][0]            
__________________________________________________________________________________________________
block_9_add (Add)               (1, 64, 24, 74)      0           block_9_projected_inputs[0][0]   
                                                                 block_9_project_bn[0][0]         
__________________________________________________________________________________________________
block_10_expand (Conv2D)        (1, 384, 24, 74)     24576       block_9_add[0][0]                
__________________________________________________________________________________________________
block_10_expand_bn (BatchNormal (1, 384, 24, 74)     1536        block_10_expand[0][0]            
__________________________________________________________________________________________________
re_lu_11 (ReLU)                 (1, 384, 24, 74)     0           block_10_expand_bn[0][0]         
__________________________________________________________________________________________________
block_10_depthwise_pad (ZeroPad (1, 384, 26, 76)     0           re_lu_11[0][0]                   
__________________________________________________________________________________________________
block_10_depthwise (DepthwiseCo (1, 384, 24, 74)     3456        block_10_depthwise_pad[0][0]     
__________________________________________________________________________________________________
block_10_depthwise_bn (BatchNor (1, 384, 24, 74)     1536        block_10_depthwise[0][0]         
__________________________________________________________________________________________________
block_10_relu (ReLU)            (1, 384, 24, 74)     0           block_10_depthwise_bn[0][0]      
__________________________________________________________________________________________________
block_10_project (Conv2D)       (1, 96, 24, 74)      36864       block_10_relu[0][0]              
__________________________________________________________________________________________________
block_10_project_bn (BatchNorma (1, 96, 24, 74)      384         block_10_project[0][0]           
__________________________________________________________________________________________________
block_11_expand (Conv2D)        (1, 576, 24, 74)     55296       block_10_project_bn[0][0]        
__________________________________________________________________________________________________
block_11_expand_bn (BatchNormal (1, 576, 24, 74)     2304        block_11_expand[0][0]            
__________________________________________________________________________________________________
re_lu_12 (ReLU)                 (1, 576, 24, 74)     0           block_11_expand_bn[0][0]         
__________________________________________________________________________________________________
block_11_depthwise_pad (ZeroPad (1, 576, 26, 76)     0           re_lu_12[0][0]                   
__________________________________________________________________________________________________
block_11_depthwise (DepthwiseCo (1, 576, 24, 74)     5184        block_11_depthwise_pad[0][0]     
__________________________________________________________________________________________________
block_11_depthwise_bn (BatchNor (1, 576, 24, 74)     2304        block_11_depthwise[0][0]         
__________________________________________________________________________________________________
block_11_relu (ReLU)            (1, 576, 24, 74)     0           block_11_depthwise_bn[0][0]      
__________________________________________________________________________________________________
block_11_project (Conv2D)       (1, 96, 24, 74)      55296       block_11_relu[0][0]              
__________________________________________________________________________________________________
block_11_projected_inputs (Conv (1, 96, 24, 74)      9216        block_10_project_bn[0][0]        
__________________________________________________________________________________________________
block_11_project_bn (BatchNorma (1, 96, 24, 74)      384         block_11_project[0][0]           
__________________________________________________________________________________________________
block_11_add (Add)              (1, 96, 24, 74)      0           block_11_projected_inputs[0][0]  
                                                                 block_11_project_bn[0][0]        
__________________________________________________________________________________________________
block_12_expand (Conv2D)        (1, 576, 24, 74)     55296       block_11_add[0][0]               
__________________________________________________________________________________________________
block_12_expand_bn (BatchNormal (1, 576, 24, 74)     2304        block_12_expand[0][0]            
__________________________________________________________________________________________________
re_lu_13 (ReLU)                 (1, 576, 24, 74)     0           block_12_expand_bn[0][0]         
__________________________________________________________________________________________________
block_12_depthwise_pad (ZeroPad (1, 576, 26, 76)     0           re_lu_13[0][0]                   
__________________________________________________________________________________________________
block_12_depthwise (DepthwiseCo (1, 576, 24, 74)     5184        block_12_depthwise_pad[0][0]     
__________________________________________________________________________________________________
block_12_depthwise_bn (BatchNor (1, 576, 24, 74)     2304        block_12_depthwise[0][0]         
__________________________________________________________________________________________________
block_12_relu (ReLU)            (1, 576, 24, 74)     0           block_12_depthwise_bn[0][0]      
__________________________________________________________________________________________________
block_12_project (Conv2D)       (1, 96, 24, 74)      55296       block_12_relu[0][0]              
__________________________________________________________________________________________________
block_12_projected_inputs (Conv (1, 96, 24, 74)      9216        block_11_add[0][0]               
__________________________________________________________________________________________________
block_12_project_bn (BatchNorma (1, 96, 24, 74)      384         block_12_project[0][0]           
__________________________________________________________________________________________________
block_12_add (Add)              (1, 96, 24, 74)      0           block_12_projected_inputs[0][0]  
                                                                 block_12_project_bn[0][0]        
__________________________________________________________________________________________________
ssd_expand_block_1_conv_0 (Conv (1, 64, 24, 74)      6208        block_12_add[0][0]               
__________________________________________________________________________________________________
ssd_expand_block_1_relu_0 (ReLU (1, 64, 24, 74)      0           ssd_expand_block_1_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_conv_1 (Conv (1, 128, 12, 37)     73728       ssd_expand_block_1_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_bn_1 (BatchN (1, 128, 12, 37)     512         ssd_expand_block_1_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_1_relu_1 (ReLU (1, 128, 12, 37)     0           ssd_expand_block_1_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_2_conv_0 (Conv (1, 64, 12, 37)      8256        ssd_expand_block_1_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_relu_0 (ReLU (1, 64, 12, 37)      0           ssd_expand_block_2_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_conv_1 (Conv (1, 128, 6, 19)      73728       ssd_expand_block_2_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_bn_1 (BatchN (1, 128, 6, 19)      512         ssd_expand_block_2_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_2_relu_1 (ReLU (1, 128, 6, 19)      0           ssd_expand_block_2_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_3_conv_0 (Conv (1, 64, 6, 19)       8256        ssd_expand_block_2_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_relu_0 (ReLU (1, 64, 6, 19)       0           ssd_expand_block_3_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_conv_1 (Conv (1, 128, 3, 10)      73728       ssd_expand_block_3_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_bn_1 (BatchN (1, 128, 3, 10)      512         ssd_expand_block_3_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_3_relu_1 (ReLU (1, 128, 3, 10)      0           ssd_expand_block_3_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_expand_block_4_conv_0 (Conv (1, 64, 3, 10)       8256        ssd_expand_block_3_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_relu_0 (ReLU (1, 64, 3, 10)       0           ssd_expand_block_4_conv_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_conv_1 (Conv (1, 128, 2, 5)       73728       ssd_expand_block_4_relu_0[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_bn_1 (BatchN (1, 128, 2, 5)       512         ssd_expand_block_4_conv_1[0][0]  
__________________________________________________________________________________________________
ssd_expand_block_4_relu_1 (ReLU (1, 128, 2, 5)       0           ssd_expand_block_4_bn_1[0][0]    
__________________________________________________________________________________________________
ssd_conf_0 (Conv2D)             (1, 6, 48, 148)      10374       re_lu_7[0][0]                    
__________________________________________________________________________________________________
ssd_conf_1 (Conv2D)             (1, 6, 24, 74)       5190        block_12_add[0][0]               
__________________________________________________________________________________________________
ssd_conf_2 (Conv2D)             (1, 6, 12, 37)       6918        ssd_expand_block_1_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_3 (Conv2D)             (1, 6, 6, 19)        6918        ssd_expand_block_2_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_4 (Conv2D)             (1, 6, 3, 10)        6918        ssd_expand_block_3_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_conf_5 (Conv2D)             (1, 6, 2, 5)         6918        ssd_expand_block_4_relu_1[0][0]  
__________________________________________________________________________________________________
permute_13 (Permute)            (1, 48, 148, 6)      0           ssd_conf_0[0][0]                 
__________________________________________________________________________________________________
permute_15 (Permute)            (1, 24, 74, 6)       0           ssd_conf_1[0][0]                 
__________________________________________________________________________________________________
permute_17 (Permute)            (1, 12, 37, 6)       0           ssd_conf_2[0][0]                 
__________________________________________________________________________________________________
permute_19 (Permute)            (1, 6, 19, 6)        0           ssd_conf_3[0][0]                 
__________________________________________________________________________________________________
permute_21 (Permute)            (1, 3, 10, 6)        0           ssd_conf_4[0][0]                 
__________________________________________________________________________________________________
permute_23 (Permute)            (1, 2, 5, 6)         0           ssd_conf_5[0][0]                 
__________________________________________________________________________________________________
ssd_loc_0 (Conv2D)              (1, 24, 48, 148)     41496       re_lu_7[0][0]                    
__________________________________________________________________________________________________
ssd_loc_1 (Conv2D)              (1, 24, 24, 74)      20760       block_12_add[0][0]               
__________________________________________________________________________________________________
ssd_loc_2 (Conv2D)              (1, 24, 12, 37)      27672       ssd_expand_block_1_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_3 (Conv2D)              (1, 24, 6, 19)       27672       ssd_expand_block_2_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_4 (Conv2D)              (1, 24, 3, 10)       27672       ssd_expand_block_3_relu_1[0][0]  
__________________________________________________________________________________________________
ssd_loc_5 (Conv2D)              (1, 24, 2, 5)        27672       ssd_expand_block_4_relu_1[0][0]  
__________________________________________________________________________________________________
conf_reshape_0 (Reshape)        (1, 42624, 1, 1)     0           permute_13[0][0]                 
__________________________________________________________________________________________________
conf_reshape_1 (Reshape)        (1, 10656, 1, 1)     0           permute_15[0][0]                 
__________________________________________________________________________________________________
conf_reshape_2 (Reshape)        (1, 2664, 1, 1)      0           permute_17[0][0]                 
__________________________________________________________________________________________________
conf_reshape_3 (Reshape)        (1, 684, 1, 1)       0           permute_19[0][0]                 
__________________________________________________________________________________________________
conf_reshape_4 (Reshape)        (1, 180, 1, 1)       0           permute_21[0][0]                 
__________________________________________________________________________________________________
conf_reshape_5 (Reshape)        (1, 60, 1, 1)        0           permute_23[0][0]                 
__________________________________________________________________________________________________
permute_14 (Permute)            (1, 48, 148, 24)     0           ssd_loc_0[0][0]                  
__________________________________________________________________________________________________
permute_16 (Permute)            (1, 24, 74, 24)      0           ssd_loc_1[0][0]                  
__________________________________________________________________________________________________
permute_18 (Permute)            (1, 12, 37, 24)      0           ssd_loc_2[0][0]                  
__________________________________________________________________________________________________
permute_20 (Permute)            (1, 6, 19, 24)       0           ssd_loc_3[0][0]                  
__________________________________________________________________________________________________
permute_22 (Permute)            (1, 3, 10, 24)       0           ssd_loc_4[0][0]                  
__________________________________________________________________________________________________
permute_24 (Permute)            (1, 2, 5, 24)        0           ssd_loc_5[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_0 (AnchorBoxes)      (1, 7104, 6, 8)      0           ssd_loc_0[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_1 (AnchorBoxes)      (1, 1776, 6, 8)      0           ssd_loc_1[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_2 (AnchorBoxes)      (1, 444, 6, 8)       0           ssd_loc_2[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_3 (AnchorBoxes)      (1, 114, 6, 8)       0           ssd_loc_3[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_4 (AnchorBoxes)      (1, 30, 6, 8)        0           ssd_loc_4[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_5 (AnchorBoxes)      (1, 10, 6, 8)        0           ssd_loc_5[0][0]                  
__________________________________________________________________________________________________
mbox_conf (Concatenate)         (1, 56868, 1, 1)     0           conf_reshape_0[0][0]             
                                                                 conf_reshape_1[0][0]             
                                                                 conf_reshape_2[0][0]             
                                                                 conf_reshape_3[0][0]             
                                                                 conf_reshape_4[0][0]             
                                                                 conf_reshape_5[0][0]             
__________________________________________________________________________________________________
loc_reshape_0 (Reshape)         (1, 42624, 1, 4)     0           permute_14[0][0]                 
__________________________________________________________________________________________________
loc_reshape_1 (Reshape)         (1, 10656, 1, 4)     0           permute_16[0][0]                 
__________________________________________________________________________________________________
loc_reshape_2 (Reshape)         (1, 2664, 1, 4)      0           permute_18[0][0]                 
__________________________________________________________________________________________________
loc_reshape_3 (Reshape)         (1, 684, 1, 4)       0           permute_20[0][0]                 
__________________________________________________________________________________________________
loc_reshape_4 (Reshape)         (1, 180, 1, 4)       0           permute_22[0][0]                 
__________________________________________________________________________________________________
loc_reshape_5 (Reshape)         (1, 60, 1, 4)        0           permute_24[0][0]                 
__________________________________________________________________________________________________
anchor_reshape_0 (Reshape)      (1, 42624, 1, 8)     0           ssd_anchor_0[0][0]               
__________________________________________________________________________________________________
anchor_reshape_1 (Reshape)      (1, 10656, 1, 8)     0           ssd_anchor_1[0][0]               
__________________________________________________________________________________________________
anchor_reshape_2 (Reshape)      (1, 2664, 1, 8)      0           ssd_anchor_2[0][0]               
__________________________________________________________________________________________________
anchor_reshape_3 (Reshape)      (1, 684, 1, 8)       0           ssd_anchor_3[0][0]               
__________________________________________________________________________________________________
anchor_reshape_4 (Reshape)      (1, 180, 1, 8)       0           ssd_anchor_4[0][0]               
__________________________________________________________________________________________________
anchor_reshape_5 (Reshape)      (1, 60, 1, 8)        0           ssd_anchor_5[0][0]               
__________________________________________________________________________________________________
mbox_conf_sigmoid (Activation)  (1, 56868, 1, 1)     0           mbox_conf[0][0]                  
__________________________________________________________________________________________________
mbox_loc (Concatenate)          (1, 56868, 1, 4)     0           loc_reshape_0[0][0]              
                                                                 loc_reshape_1[0][0]              
                                                                 loc_reshape_2[0][0]              
                                                                 loc_reshape_3[0][0]              
                                                                 loc_reshape_4[0][0]              
                                                                 loc_reshape_5[0][0]              
__________________________________________________________________________________________________
mbox_priorbox (Concatenate)     (1, 56868, 1, 8)     0           anchor_reshape_0[0][0]           
                                                                 anchor_reshape_1[0][0]           
                                                                 anchor_reshape_2[0][0]           
                                                                 anchor_reshape_3[0][0]           
                                                                 anchor_reshape_4[0][0]           
                                                                 anchor_reshape_5[0][0]           
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (1, 56868, 1, 13)    0           mbox_conf_sigmoid[0][0]          
                                                                 mbox_loc[0][0]                   
                                                                 mbox_priorbox[0][0]              
__________________________________________________________________________________________________
ssd_predictions (Reshape)       (1, 56868, 13)       0           concatenate_2[0][0]              
==================================================================================================
Total params: 1,136,116
Trainable params: 1,118,964
Non-trainable params: 17,152
__________________________________________________________________________________________________
2020-05-05 12:59:49,180 [INFO] iva.ssd.scripts.train: Number of images in the training dataset:	  3398
2020-05-05 12:59:49,180 [INFO] iva.ssd.scripts.train: Number of images in the validation dataset:	   849
Epoch 1/80
3398/3398 [==============================] - 494s 145ms/step - loss: 10.2165

Epoch 00001: saving model to /workspace/training/exp/experiment_dir_unpruned/weights/ssd_mobilenet_v2_epoch_001.tlt
Epoch 2/80
3398/3398 [==============================] - 486s 143ms/step - loss: 2.1691

Epoch 00002: saving model to /workspace/training/exp/experiment_dir_unpruned/weights/ssd_mobilenet_v2_epoch_002.tlt
Epoch 3/80
3398/3398 [==============================] - 489s 144ms/step - loss: 1.7975

Epoch 00003: saving model to /workspace/training/exp/experiment_dir_unpruned/weights/ssd_mobilenet_v2_epoch_003.tlt
Epoch 4/80
3398/3398 [==============================] - 487s 143ms/step - loss: 1.5132

Epoch 00004: saving model to /workspace/training/exp/experiment_dir_unpruned/weights/ssd_mobilenet_v2_epoch_004.tlt
Epoch 5/80
3398/3398 [==============================] - 490s 144ms/step - loss: 1.7766

Epoch 00005: saving model to /workspace/training/exp/experiment_dir_unpruned/weights/ssd_mobilenet_v2_epoch_005.tlt
Epoch 6/80
1326/3398 [==========>...................] - ETA: 5:01 - loss: nan                Batch 1325: Invalid loss, terminating training

Epoch 00006: saving model to /workspace/training/exp/experiment_dir_unpruned/weights/ssd_mobilenet_v2_epoch_006.tlt

The associated spec file:

random_seed: 42
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "mobilenet_v2"
  freeze_bn: false
}
training_config {
  batch_size_per_gpu: 1
  num_epochs: 80
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.15
    annealing: 0.5
    }
  }
  regularizer {
    type: L1
    weight: 3e-06
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 1
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1184
    output_image_height: 384
    output_image_channel: 3
    crop_right: 1184
    crop_bottom: 384
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/training/data/tfrecords//kitti_trainval*"
    image_directory_path: "/workspace/training/data/train"
  }
  image_extension: "png"
  target_class_mapping {
      key: "myclass"
      value: "myclass"
  }
validation_fold: 0
}

Hi ,
Could you double check your dataset?
If you resize your images to 1184x384, please make sure the labels are resized accordingly.

You also mention that “Train with image size 1200x400, we can set the batch size to 16: training completed”. So is it fine for you?

Hi,
We have tested with 3 differents dataset, which all work in TLT 1

Sure, “Train with image size 1200x400, we can set the batch size to 16: training completed” is fine for us, but we have to set the batch size to 1 for training with biggers images (differents dataset) and in that case it’s doesn’t work.

Hi,
Could you please tell the exact resolution of your bigger images?
More, you mention that “have to set the batch size to 1” to avoid OOM, right? Did you try other bs, such as bs2 or bs4 which works without OOM and NaN.
I’m afraid there should be something needed to finetune for the hyper-parameters, such as max_lr,min_lr,etc.

The resolution of bigger images: 1280x1024

We have tried with a batch size of 2, we got the NaN loss at a different epoch.

Does the optimizer for SSD changed between TLT 1 and 2, from ADAM to SGD for example?

It’s certainly only a problem of hyper-parameters tuning (find the best learning rate scheduling, bs…) but we didn’t have as many problems with TLT 1.

No, there is no change between two versions. The optimizer is using SGD.
You can also run the Jupyter notebook to check the behavior of two versions.

BTW, with your own dataset and same spec, is it working via TLT 1 but not working via TLT2?