Invalid loss on YOLO v4 model with latest TAO release

I have recently upgraded to the latest TAO release (3.21.11) and attempted to train a YOLO v4 model. I initially used an experiment configuration that I have already ran in the previous TAO release that ran successfully. When I attempted to run it on the new release I received an incredibly high initial loss value (~118000) followed by an immediate invalid loss and termination of experiment. I then proceeded to downgrade to the previous version that I used to ensure it was not an experiment configuration issue and I had no issues running with the previous release. Attached below is the experiment config used, training was done on a GCP machine with 4 Tesla V100 GPUs, with training ran parrellel across all GPUs

yolov4_config {
      big_anchor_shape: "[(96, 96), (96, 167), (167, 96)]"
      mid_anchor_shape: "[(60, 60), (60, 104), (104, 60)]"
      small_anchor_shape: "[(27, 27), (27, 47), (47, 27)]"
      box_matching_iou: 0.5
      matching_neutral_box_iou: 0.5
      arch: "resnet"
      nlayers: 34
      arch_conv_blocks: 2
      loss_loc_weight: 5.0
      loss_neg_obj_weights: 50.0
      loss_class_weights: 1.0
      label_smoothing: 0.0
      big_grid_xy_extend: 0.05
      mid_grid_xy_extend: 0.1
      small_grid_xy_extend: 0.2
      freeze_bn: False
      freeze_blocks: 0
      force_relu: False
    }
    
    training_config {
      batch_size_per_gpu: 8
      num_epochs: 40
      enable_qat: False
      checkpoint_interval: 1
      learning_rate {
        soft_start_cosine_annealing_schedule {
            
        min_learning_rate: 2e-06
        max_learning_rate: 0.0001
        soft_start: 0.15
        
        }
      }
      regularizer {
        type: L2
        weight: 1e-05
      }
      optimizer {
        adam {
          epsilon: 1e-7
          beta1: 0.9
          beta2: 0.999
          amsgrad: false
        }
      }
      pretrain_model_path: "/workspace/TAO/pretrained_models/resnet_34.hdf5"
      
    }
    
    
    eval_config {
      
      average_precision_mode: SAMPLE
      batch_size: 8
      matching_iou_threshold: 0.5
    }
    
    nms_config {
      confidence_threshold: 0.001
      clustering_iou_threshold: 0.5
      top_k: 200
    }
    
    augmentation_config {
    
        hue: 0.1
        saturation: 1.5
        exposure: 1.5
        vertical_flip: 0.0
        horizontal_flip: 0.5
        jitter: 0.3
        output_width: 960
        output_height: 544
        output_channel: 3
        randomize_input_shape_period: 10
        
            mosaic_prob: 0.5
            mosaic_min_ratio: 0.2
            
        image_mean {
            key: 'b'
            value: 96.88478016757078
        }
        image_mean {
            key: 'g'
            value: 99.25680169936167
        }
        image_mean {
            key: 'r'
            value: 102.9482929187028
        }
        
    }
    
    dataset_config {
      data_sources: {
          label_directory_path: "/workspace/DAB/D_78/train/labels"
          image_directory_path: "/workspace/DAB/D_78/train/images"
      }
      include_difficult_in_training: false
      
            target_class_mapping {
                key: "p_1"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_2"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_3"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_4"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_5"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_6"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_7"
                value: "p"
            }
        
            target_class_mapping {
                key: "p_8"
                value: "p"
            }
        
            target_class_mapping {
                key: "r_1"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_2"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_3"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_4"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_5"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_6"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_7"
                value: "r"
            }
        
            target_class_mapping {
                key: "r_8"
                value: "r"
            }
        
      validation_data_sources: {
        label_directory_path: "/workspace/DAB/D_78/val/labels"
        image_directory_path: "/workspace/DAB/D_78/val/images"
      }
    }

Please continue the training. The loss will be decreasing.

As stated in the original post, the experiment reaches an immediate invalid loss and termination of experiment, therefore training cannot be continued.

So, it is NaN loss during training.
Please try to set a larger max_lr or smaller bs.

neither worked

Please modfiy as below and retry.
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0

The suggested changes were made and the experiment was able to execute. However, the experiment is still at an incredibly high initial loss when compared to the identical experiment is the previous TAO versions. The experiment time per epoch is also greatly effected
TAO 3.0-21.11:
Intial loss ~90,000
Loss after first epoch ~30,000
Time per epoch ~23 minutes

TAO 3.0-21.08:
Initial loss ~50
Loss after first epoch ~30
Time per epoch ~12 minutes

There is also a loss in accuracy, though that may be do to a non-optimized loss weight factor

For training speed and mAP, I will trigger experiments comparing 3.21.08 and 3.21.11 with KITTI public dataset.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.