Problems encountered in training unet and inference unet

Hi. I trained unet. This is an example of an input image with its mask images, their sizes are 3840*2160*3 and 3840*2160*1

And this is my spec file:

random_seed: 42
model_config {
  model_input_width: 960
  model_input_height: 960
  model_input_channels: 3
  num_layers: 18
  all_projections: true
  arch: "resnet"
  use_batch_norm: False
  training_precision {
    backend_floatx: FLOAT32
  }
}

training_config {
  batch_size: 2
  epochs: 100
  log_summary_steps: 20
  checkpoint_interval: 1
  loss: "cross_dice_sum"
  learning_rate:0.00001
  regularizer {
    type: L2
    weight: 2e-5
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

dataset_config {
  dataset: "custom"
  augment: False
  augmentation_config {
    spatial_augmentation {
    hflip_probability : 0.5
    vflip_probability : 0.5
    crop_and_resize_prob : 0.5
  }
  brightness_augmentation {
    delta: 0.2
  }
}
input_image_type: "grayscale"
train_images_path:"/workspace/fire_detection/fire/images/train"
train_masks_path:"/workspace/fire_detection/fire/masks/train"

val_images_path:"/workspace/fire_detection/fire/images/val"
val_masks_path:"/workspace/fire_detection/fire/masks/val"

test_images_path:"/workspace/fire_detection/fire/images/val_resize"

data_class_config {
  target_classes {
    name: "foreground"
    mapping_class: "foreground"
    label_id: 0
  }
  target_classes {
    name: "background"
    mapping_class: "background"
    label_id: 1
  }
}
}

Here my questions:

  1. When I set the input_image_type: “color”, the evaluate is as follows:
"{'foreground': {'precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0, 'iou': 1.0}, 'background': {'precision': nan, 'Recall': nan, 'F1 Score': nan, 'iou': nan}}"

and when I set the input_image_type: “grayscale”, the evaluate is as follows:

"{'foreground': {'precision': 0.9991926, 'Recall': 0.99150133, 'F1 Score': 0.9953321053846278, 'iou': 0.9907076}, 'background': {'precision': 0.9933318, 'Recall': 0.99936754, 'F1 Score': 0.9963404918179835, 'iou': 0.9927078}}"

I want to know what caused this result?

  1. when I set the “input_image_type: “grayscale””, the evaluate is as follows:
"{'foreground': {'precision': 0.9991926, 'Recall': 0.99150133, 'F1 Score': 0.9953321053846278, 'iou': 0.9907076}, 'background': {'precision': 0.9933318, 'Recall': 0.99936754, 'F1 Score': 0.9963404918179835, 'iou': 0.9927078}}"

Then when I use inference, the spec file used is as above. When my input image is 3840*2160*3, the result is as follows, its size is 960*960*1.

The result is very poor. I am very confused and hope to get your help.

This is an example of an input image, its sizes is 3840*2160*3.

This is the mask corresponding to the image above, its sizes is 3840*2160*1

Will check it. Is your dataset a public one?

Yes, the dataset link is https://ieee-dataport.org/open-access/flame-dataset-aerial-imagery-pile-burn-detection-using-drones-uavs/embed.
dataset
The images of the dataset is zip file 9;
In zip file 10, the background is with 0 pixels and the foreground with 1 pixels, I change the background is white with 255 pixels and the foreground is black with 0 pixels, then we get the masks of the dataset.

The above is the construction process of my dataset.

Thanks for the info. I will try it.

Can you train with original mask label without any modification?

Yes, the result of training with original mask label without any modificationis is as follows:

'no_fire': {'precision': 0.9961964, 'Recall': 1.0, 'F1 Score': 0.9980945708436912, 'iou': 0.9961964}, 'fire': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

Did you train with TLT 3.0-dp-py3 or TLT 3.0-py3 ?
Is there training log which can be shared?

My tlt info is:

ubuntu_1804:~$ tlt info
Configuration of the TLT Instance
dockers: ['nvidia/tlt-streamanalytics', 'nvidia/tlt-pytorch']
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

The training log of input_image_type: “grayscale” is as follows:
output.log (37.9 KB)

Please

  1. Update to the latest docker which is released yesterday. See Migrating to TAO Toolkit — TAO Toolkit 3.0 documentation and NVIDIA NGC
  2. Using input_image_type: “color” since the fire images are color
  3. Please do not modify the mask images. Just use the default ones in public dataset.
  4. Train with below spec.

random_seed: 42
model_config {
model_input_width: 960
model_input_height: 544
model_input_channels: 3
num_layers: 18
all_projections: true
arch: “resnet”
use_batch_norm: False
training_precision {
backend_floatx: FLOAT32
}
}
training_config {
batch_size: 2
epochs: 500
log_summary_steps: 20
checkpoint_interval: 1
loss: “cross_dice_sum”
learning_rate:0.00001
regularizer {
type: L2
weight: 2e-5
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
}
dataset_config {
dataset: “custom”
augment: true
augmentation_config {
spatial_augmentation {
hflip_probability : 0.5
vflip_probability : 0.5
crop_and_resize_prob : 0.5
}
brightness_augmentation {
delta: 0.2
}
}
input_image_type: “color”
train_images_path:"/workspace/fire_detection/fire/images/train"
train_masks_path:"/workspace/fire_detection/fire/masks/train"

val_images_path:"/workspace/fire_detection/fire/images/val"
val_masks_path:"/workspace/fire_detection/fire/masks/val"

test_images_path:"/workspace/fire_detection/fire/images/val_resize"
data_class_config {
target_classes {
name: “fire”
mapping_class: “fire”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 1
}
}
}

Thanks, but when I train the model just as you said, I get the error:

tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

The training log is output.log (5.4 KB)

I did not meet this issue. Seems that you meet Nan even in 1st epoch.
Is this the first time you meet during training?

Yes, I meet this issue for the first time, and all the settings are the same as you said.

ubuntu_1804:~$ tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

Can you try again? Please use a new result folder.

Yes, I try to use a new result folder again, and I get the results_tlt.json and output.log as follows:

"{'fire': {'precision': 0.99351597, 'Recall': 1.0, 'F1 Score': 0.9967474393080977, 'iou': 0.99351597}, 'background': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

output.log (100.2 KB)

The results obtained are not satisfactory.

Can you try to run inference via “tlt unet inference”?

I have uninstalled nvidia-tlt before installing the nvidia-tao.

ubuntu_1804:~$ tlt unet inference --gpu_index=0 -e /workspace/fire_detection/fire/specs/unet_train_resnet_unet_isbi.txt -m /workspace/fire_detection/result/model.step-13300.tlt -o /workspace/fire_detection/result/ -k nvidia_tlt830

Command 'tlt' not found, did you mean:

  command 'tt' from deb treetop
  command 'tlf' from deb tlf
  command 'lt' from deb looptools
  command 'tgt' from deb tcm
  command 'llt' from deb storebackup
  command 'tla' from deb tla
  command 'tst' from deb pvm-examples
  command 'slt' from deb slt
  command 'tlp' from deb tlp
  command 'tilt' from deb ruby-tilt

Try: apt install <deb name>

If you already update to tao, you can run "tao unet inference”.

I run the command tao unet inference, but the result is poor.

I run tao unet evaluate, and get the the result_tlt.json is:

"{'fire': {'precision': 0.99351597, 'Recall': 1.0, 'F1 Score': 0.9967474393080977, 'iou': 0.99351597}, 'background': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

The precision is nan, and the iou is 0.