Problem in training unet

Hi. I trained unet for 300 epochs. But the result of the foreground class is always nan. This is the results of evaluation task:

"{'background': {'precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0, 'iou': 1.0}, 'foreground': {'precision': nan, 'Recall': nan, 'F1 Score': nan, 'iou': nan}}"

And this is my spec file:

random_seed: 42
model_config {
  model_input_width: 640
  model_input_height: 640
  model_input_channels: 3
  num_layers: 18
  all_projections: true
  arch: "resnet"
  use_batch_norm: False
  training_precision {
    backend_floatx: FLOAT32
  }
}

training_config {
  batch_size: 2
  epochs: 300
  log_summary_steps: 10
  checkpoint_interval: 5
  loss: "cross_dice_sum"
  learning_rate:0.0001
  regularizer {
    type: L2
    weight: 2e-5
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

dataset_config {
  dataset: "custom"
  augment: False
  augmentation_config {
    spatial_augmentation {
    hflip_probability : 0.5
    vflip_probability : 0.5
    crop_and_resize_prob : 0.5
  }
  brightness_augmentation {
    delta: 0.2
  }
}
input_image_type: "color"
train_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/images/train/"
train_masks_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/masks/train"

val_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/images/val"
val_masks_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/masks/val"

test_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/images/test"

data_class_config {
  target_classes {
    name: "foreground"
    mapping_class: "foreground"
    label_id: 1
  }
  target_classes {
    name: "background"
    mapping_class: "background"
    label_id: 0
  }
}
}
    

Can you share the training log?
More, did you ever try the unet jupyter notebook?

Hi. Thanks for your help.
Yes I did according to the notebook.
This is my spec file for 10 epochs:

random_seed: 42
model_config {
  model_input_width: 640
  model_input_height: 640
  model_input_channels: 3
  num_layers: 18
  all_projections: true
  arch: "resnet"
  use_batch_norm: False
  training_precision {
    backend_floatx: FLOAT32
  }
}

training_config {
  batch_size: 2
  epochs: 10
  log_summary_steps: 10
  checkpoint_interval: 5
  loss: "cross_dice_sum"
  learning_rate:0.0001
  regularizer {
    type: L2
    weight: 2e-5
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

dataset_config {
  dataset: "custom"
  augment: False
  augmentation_config {
    spatial_augmentation {
    hflip_probability : 0.5
    vflip_probability : 0.5
    crop_and_resize_prob : 0.5
  }
  brightness_augmentation {
    delta: 0.2
  }
}
input_image_type: "color"
train_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/images/train/"
train_masks_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/masks/train"

val_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/images/val"
val_masks_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/masks/val"

test_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp/images/test"
data_class_config {
  target_classes {
    name: "foreground"
    mapping_class: "foreground"
    label_id: 1
  }
  target_classes {
    name: "background"
    mapping_class: "background"
    label_id: 0
  }
}
}

and this is training log:

Loading experiment spec at /workspace/tlt/local_dir/tlt_unet_corrosion_resnet18/unpruned_model/final_spec.txt.
Running for 10 Epochs
Epoch: 0/10:, Cur-Step: 0, loss(cross_dice_sum): 1.09276, Running average loss:1.09276, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 10, loss(cross_dice_sum): 0.48246, Running average loss:0.95171, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 20, loss(cross_dice_sum): 0.05996, Running average loss:0.54133, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 30, loss(cross_dice_sum): 0.05802, Running average loss:0.38571, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 40, loss(cross_dice_sum): 0.05610, Running average loss:0.30553, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 50, loss(cross_dice_sum): 0.05424, Running average loss:0.25642, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 60, loss(cross_dice_sum): 0.05247, Running average loss:0.22311, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 70, loss(cross_dice_sum): 0.05080, Running average loss:0.19895, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 80, loss(cross_dice_sum): 0.04922, Running average loss:0.18055, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 90, loss(cross_dice_sum): 0.04774, Running average loss:0.16603, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/10:, Cur-Step: 100, loss(cross_dice_sum): 0.04636, Running average loss:0.15424, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 1/10:, Cur-Step: 110, loss(cross_dice_sum): 0.04506, Running average loss:0.04550, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 120, loss(cross_dice_sum): 0.04384, Running average loss:0.04488, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 130, loss(cross_dice_sum): 0.04269, Running average loss:0.04428, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 140, loss(cross_dice_sum): 0.04163, Running average loss:0.04371, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 150, loss(cross_dice_sum): 0.04062, Running average loss:0.04316, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 160, loss(cross_dice_sum): 0.03968, Running average loss:0.04263, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 170, loss(cross_dice_sum): 0.03880, Running average loss:0.04212, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 180, loss(cross_dice_sum): 0.03798, Running average loss:0.04164, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 190, loss(cross_dice_sum): 0.03720, Running average loss:0.04118, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 1/10:, Cur-Step: 200, loss(cross_dice_sum): 0.03647, Running average loss:0.04073, Time taken: 0:00:00.351211 ETA: 0:00:03.160900
Epoch: 2/10:, Cur-Step: 210, loss(cross_dice_sum): 0.03579, Running average loss:0.03592, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 220, loss(cross_dice_sum): 0.03514, Running average loss:0.03559, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 230, loss(cross_dice_sum): 0.03453, Running average loss:0.03528, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 240, loss(cross_dice_sum): 0.03396, Running average loss:0.03497, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 250, loss(cross_dice_sum): 0.03341, Running average loss:0.03468, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 260, loss(cross_dice_sum): 0.03290, Running average loss:0.03440, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 270, loss(cross_dice_sum): 0.03241, Running average loss:0.03413, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 280, loss(cross_dice_sum): 0.03195, Running average loss:0.03386, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 290, loss(cross_dice_sum): 0.03151, Running average loss:0.03361, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 2/10:, Cur-Step: 300, loss(cross_dice_sum): 0.03110, Running average loss:0.03337, Time taken: 0:00:00.440181 ETA: 0:00:03.521450
Epoch: 3/10:, Cur-Step: 310, loss(cross_dice_sum): 0.03070, Running average loss:0.03072, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 320, loss(cross_dice_sum): 0.03032, Running average loss:0.03053, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 330, loss(cross_dice_sum): 0.02996, Running average loss:0.03034, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 340, loss(cross_dice_sum): 0.02962, Running average loss:0.03017, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 350, loss(cross_dice_sum): 0.02929, Running average loss:0.02999, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 360, loss(cross_dice_sum): 0.02897, Running average loss:0.02982, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 370, loss(cross_dice_sum): 0.02866, Running average loss:0.02966, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 380, loss(cross_dice_sum): 0.02837, Running average loss:0.02950, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 390, loss(cross_dice_sum): 0.02809, Running average loss:0.02934, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 400, loss(cross_dice_sum): 0.02782, Running average loss:0.02919, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 3/10:, Cur-Step: 410, loss(cross_dice_sum): 0.02756, Running average loss:0.02904, Time taken: 0:00:00.655493 ETA: 0:00:04.588454
Epoch: 4/10:, Cur-Step: 420, loss(cross_dice_sum): 0.02730, Running average loss:0.02740, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 430, loss(cross_dice_sum): 0.02706, Running average loss:0.02728, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 440, loss(cross_dice_sum): 0.02682, Running average loss:0.02716, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 450, loss(cross_dice_sum): 0.02659, Running average loss:0.02704, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 460, loss(cross_dice_sum): 0.02637, Running average loss:0.02692, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 470, loss(cross_dice_sum): 0.02616, Running average loss:0.02681, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 480, loss(cross_dice_sum): 0.02595, Running average loss:0.02670, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 490, loss(cross_dice_sum): 0.02574, Running average loss:0.02659, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 500, loss(cross_dice_sum): 0.02555, Running average loss:0.02648, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 4/10:, Cur-Step: 510, loss(cross_dice_sum): 0.02535, Running average loss:0.02638, Time taken: 0:00:00.150345 ETA: 0:00:00.902072
Epoch: 5/10:, Cur-Step: 520, loss(cross_dice_sum): 0.02517, Running average loss:0.02521, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 530, loss(cross_dice_sum): 0.02498, Running average loss:0.02512, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 540, loss(cross_dice_sum): 0.02481, Running average loss:0.02503, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 550, loss(cross_dice_sum): 0.02463, Running average loss:0.02494, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 560, loss(cross_dice_sum): 0.02446, Running average loss:0.02485, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 570, loss(cross_dice_sum): 0.02430, Running average loss:0.02477, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 580, loss(cross_dice_sum): 0.02413, Running average loss:0.02468, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 590, loss(cross_dice_sum): 0.02398, Running average loss:0.02460, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 600, loss(cross_dice_sum): 0.02382, Running average loss:0.02452, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 5/10:, Cur-Step: 610, loss(cross_dice_sum): 0.02367, Running average loss:0.02444, Time taken: 0:00:00.364646 ETA: 0:00:01.823229
Epoch: 6/10:, Cur-Step: 620, loss(cross_dice_sum): 0.02352, Running average loss:0.02353, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 630, loss(cross_dice_sum): 0.02337, Running average loss:0.02346, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 640, loss(cross_dice_sum): 0.02323, Running average loss:0.02339, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 650, loss(cross_dice_sum): 0.02309, Running average loss:0.02332, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 660, loss(cross_dice_sum): 0.02295, Running average loss:0.02325, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 670, loss(cross_dice_sum): 0.02282, Running average loss:0.02318, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 680, loss(cross_dice_sum): 0.02269, Running average loss:0.02311, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 690, loss(cross_dice_sum): 0.02256, Running average loss:0.02304, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 700, loss(cross_dice_sum): 0.02243, Running average loss:0.02297, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 710, loss(cross_dice_sum): 0.02230, Running average loss:0.02291, Time taken: 0:00:00.614153 ETA: 0:00:02.456614
Epoch: 6/10:, Cur-Step: 720, loss(cross_dice_sum): 0.02218, Running average loss:0.02284, Time taken: 0:00:01.165617 ETA: 0:00:04.662469
Epoch: 7/10:, Cur-Step: 730, loss(cross_dice_sum): 0.02206, Running average loss:0.02211, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 740, loss(cross_dice_sum): 0.02194, Running average loss:0.02205, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 750, loss(cross_dice_sum): 0.02182, Running average loss:0.02199, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 760, loss(cross_dice_sum): 0.02170, Running average loss:0.02193, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 770, loss(cross_dice_sum): 0.02159, Running average loss:0.02187, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 780, loss(cross_dice_sum): 0.02148, Running average loss:0.02182, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 790, loss(cross_dice_sum): 0.02137, Running average loss:0.02176, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 800, loss(cross_dice_sum): 0.02126, Running average loss:0.02170, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 810, loss(cross_dice_sum): 0.02115, Running average loss:0.02165, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 7/10:, Cur-Step: 820, loss(cross_dice_sum): 0.02105, Running average loss:0.02159, Time taken: 0:00:01.165617 ETA: 0:00:03.496852
Epoch: 8/10:, Cur-Step: 830, loss(cross_dice_sum): 0.02094, Running average loss:0.02097, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 840, loss(cross_dice_sum): 0.02084, Running average loss:0.02092, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 850, loss(cross_dice_sum): 0.02074, Running average loss:0.02087, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 860, loss(cross_dice_sum): 0.02064, Running average loss:0.02082, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 870, loss(cross_dice_sum): 0.02054, Running average loss:0.02077, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 880, loss(cross_dice_sum): 0.02044, Running average loss:0.02072, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 890, loss(cross_dice_sum): 0.02034, Running average loss:0.02067, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 900, loss(cross_dice_sum): 0.02025, Running average loss:0.02062, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 910, loss(cross_dice_sum): 0.02015, Running average loss:0.02057, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 8/10:, Cur-Step: 920, loss(cross_dice_sum): 0.02006, Running average loss:0.02052, Time taken: 0:00:00.285859 ETA: 0:00:00.571719
Epoch: 9/10:, Cur-Step: 930, loss(cross_dice_sum): 0.01997, Running average loss:0.01998, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 940, loss(cross_dice_sum): 0.01988, Running average loss:0.01994, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 950, loss(cross_dice_sum): 0.01979, Running average loss:0.01989, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 960, loss(cross_dice_sum): 0.01970, Running average loss:0.01985, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 970, loss(cross_dice_sum): 0.01961, Running average loss:0.01980, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 980, loss(cross_dice_sum): 0.01952, Running average loss:0.01976, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 990, loss(cross_dice_sum): 0.01944, Running average loss:0.01971, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 1000, loss(cross_dice_sum): 0.01935, Running average loss:0.01967, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 1010, loss(cross_dice_sum): 0.01927, Running average loss:0.01963, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Epoch: 9/10:, Cur-Step: 1020, loss(cross_dice_sum): 0.01919, Running average loss:0.01958, Time taken: 0:00:00.513137 ETA: 0:00:00.513137
Saving the final step model to /workspace/tlt/local_dir/tlt_unet_corrosion_resnet18/unpruned_model/weights/unet_model.tlt

This is an example of an input image with its mask image:

But the all inference results are look like this image:

Did you ever run the Unet jupyter notebook successfully? If yes, can you leverage it?

I tried and registered on the ISBI challenge site but they did not email me.
I think my problem is with the mask images and their labels.
My segmented mask has white color with label 1, and black color with label 0 for the background part.
So, in the spec file, I set the foreground class label to 1 and the background to zero.
But in the default spec file in the notebook, the foreground class label is zero and the background is one.
Also when I use the vis_annotation.py function in jupyter notebook to overlay the masks on images, this function masks the background.

Also, please try a lower learning rate. For example, 0.1 of previous value.

I tried but the result did not change.
I realized that when background is white with label zero in spec file and foreground is black with label 255 in spec file, the algorithm does not return NaN but the result is still not good. I have no idea what is happening. I think it is because of using OpenCV to read train images inside unet.
These are the results of 1500 epochs:
Screenshot from 2021-08-12 07-56-17

"{'foreground': {'precision': 0.6823535, 'Recall': 0.6018939, 'F1 Score': 0.6396032555448399, 'iou': 0.47015935}, 'background': {'precision': 0.5127079, 'Recall': 0.59918934, 'F1 Score': 0.5525854258624204, 'iou': 0.3817741}}"

and these are inference results:
overlay image:

but all mask labels are black

The red area is the segmented part or blue area?

I still have issue with train Unet.
Please help me

Is your training dataset public? If yes, please share the link.
For ISBI dataset mentioned in jupyter notebook, did you get it and try now?
More, could you please search Unet in TLT forum to find the public dataset mentioned by other users and try it?

Hi Morganh. Thanks for your help.
I did not recieve an email to download ISBI dataset.
This is my dataset link:

The mask images are black and red and I converted them to black and white. The background is black with 0 pixels and the foreground is white with 1 pixels. In fact I used PIL putpalette
I think when the images are converted to numpy array or read using openCV, the color palette is lost. So pixel 1 becomes 255. But I saw this link that the labels are Integer and do not have my problem

Could you keep the original images and run training without converting them to black and white?

I tried. But in this case, I have to set the label to 38. Because the red pixel of the mask is 38.
If I set the label to 1, the result is nan again.

You can just focus on overlay image.
For " all mask labels are black" , you can use below to generate a new png file for visualization.

Download your label image:
$ wget https://aws1.discourse-cdn.com/nvidia/original/3X/0/6/06218499d473df71d0ac8627c2f9721f2796ef8b.png

Then, run below.

import cv2
from PIL import Image
import numpy as np

png_file = ‘./06218499d473df71d0ac8627c2f9721f2796ef8b.png’
img = Image.open(png_file)
arr = np.array(img)
cv2.imwrite("./label_image.png",arr.astype(np.uint8)*255)

Thank you.
Do I still have to write 255 for label?

In “mask_labels_tlt” folder, if you are confused about the black images inside it, please use above script to generate the real label images.

1 Like

Yes, I got it. But my question is about set labels in the spec file that I have to set foreground label to 255.

It is not related to the foreground. The numpy value in the resulted image is related to the lable_id.
Since you set

label_id: 1

label_id: 0

and only training for these two classes, thus, the resulted png file has a numpy value of 0 or 1.

Yes, you are right. But in this case, according to the first problem I had, the train process does not work properly and validation results is always nan for foreground class and the model does not train at all

Could you share your latest training spec file which is working?

This my final spec file:

random_seed: 42
model_config {
  model_input_width: 640
  model_input_height: 640
  model_input_channels: 3
  num_layers: 101
  all_projections: true
  arch: "resnet"
  freeze_blocks: 0
  freeze_blocks: 1
  use_batch_norm: True
  training_precision {
    backend_floatx: FLOAT32
  }
}

training_config {
  batch_size: 2
  epochs: 300
  log_summary_steps: 10
  checkpoint_interval: 5
  loss: "cross_entropy"
  learning_rate:0.0001
  regularizer {
    type: L2
    weight: 3e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}
dataset_config {
  dataset: "custom"
  augment: False
  augmentation_config {
    spatial_augmentation {
    hflip_probability : 0.5
    vflip_probability : 0.5
    crop_and_resize_prob : 0.5
    }
    brightness_augmentation {
      delta: 0.2
    }
  }
  input_image_type: "color"
  train_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp_unet/images/train/"
  train_masks_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp_unet/masks/train"

  val_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp_unet/images/val"
  val_masks_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp_unet/masks/val"

  test_images_path:"/workspace/tlt/tlt-experiments/all_segmentation_approaches/corrosion_dataset_temp_unet/images/test"

  data_class_config {
    target_classes {
      name: "background"
      mapping_class: "background"
      label_id: 0
    }
    target_classes {
      name: "foreground"
      mapping_class: "foreground"
      label_id: 255
    }  
  }
}