Problems encountered in training unet and inference unet

tiaotiaowang226 · August 23, 2021, 8:58am

Hi. I trained unet. This is an example of an input image with its mask images, their sizes are 3840*2160*3 and 3840*2160*1

And this is my spec file:

random_seed: 42
model_config {
  model_input_width: 960
  model_input_height: 960
  model_input_channels: 3
  num_layers: 18
  all_projections: true
  arch: "resnet"
  use_batch_norm: False
  training_precision {
    backend_floatx: FLOAT32
  }
}

training_config {
  batch_size: 2
  epochs: 100
  log_summary_steps: 20
  checkpoint_interval: 1
  loss: "cross_dice_sum"
  learning_rate:0.00001
  regularizer {
    type: L2
    weight: 2e-5
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

dataset_config {
  dataset: "custom"
  augment: False
  augmentation_config {
    spatial_augmentation {
    hflip_probability : 0.5
    vflip_probability : 0.5
    crop_and_resize_prob : 0.5
  }
  brightness_augmentation {
    delta: 0.2
  }
}
input_image_type: "grayscale"
train_images_path:"/workspace/fire_detection/fire/images/train"
train_masks_path:"/workspace/fire_detection/fire/masks/train"

val_images_path:"/workspace/fire_detection/fire/images/val"
val_masks_path:"/workspace/fire_detection/fire/masks/val"

test_images_path:"/workspace/fire_detection/fire/images/val_resize"

data_class_config {
  target_classes {
    name: "foreground"
    mapping_class: "foreground"
    label_id: 0
  }
  target_classes {
    name: "background"
    mapping_class: "background"
    label_id: 1
  }
}
}

Here my questions:

When I set the input_image_type: “color”, the evaluate is as follows:

"{'foreground': {'precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0, 'iou': 1.0}, 'background': {'precision': nan, 'Recall': nan, 'F1 Score': nan, 'iou': nan}}"

and when I set the input_image_type: “grayscale”, the evaluate is as follows:

"{'foreground': {'precision': 0.9991926, 'Recall': 0.99150133, 'F1 Score': 0.9953321053846278, 'iou': 0.9907076}, 'background': {'precision': 0.9933318, 'Recall': 0.99936754, 'F1 Score': 0.9963404918179835, 'iou': 0.9927078}}"

I want to know what caused this result？

when I set the “input_image_type: “grayscale””, the evaluate is as follows:

"{'foreground': {'precision': 0.9991926, 'Recall': 0.99150133, 'F1 Score': 0.9953321053846278, 'iou': 0.9907076}, 'background': {'precision': 0.9933318, 'Recall': 0.99936754, 'F1 Score': 0.9963404918179835, 'iou': 0.9927078}}"

Then when I use inference, the spec file used is as above. When my input image is 3840*2160*3, the result is as follows, its size is 960*960*1.

The result is very poor. I am very confused and hope to get your help.

tiaotiaowang226 · August 23, 2021, 9:02am

This is an example of an input image, its sizes is 3840*2160*3.

tiaotiaowang226 · August 23, 2021, 9:05am

This is the mask corresponding to the image above, its sizes is 3840*2160*1

Morganh · August 23, 2021, 2:42pm

Will check it. Is your dataset a public one?

tiaotiaowang226 · August 24, 2021, 1:49am

Yes, the dataset link is https://ieee-dataport.org/open-access/flame-dataset-aerial-imagery-pile-burn-detection-using-drones-uavs/embed.
dataset
The images of the dataset is zip file 9;
In zip file 10, the background is with 0 pixels and the foreground with 1 pixels, I change the background is white with 255 pixels and the foreground is black with 0 pixels, then we get the masks of the dataset.

The above is the construction process of my dataset.

Morganh · August 24, 2021, 4:58am

Thanks for the info. I will try it.

Morganh · August 25, 2021, 2:34am

Can you train with original mask label without any modification?

tiaotiaowang226 · August 25, 2021, 5:51am

Yes, the result of training with original mask label without any modificationis is as follows:

'no_fire': {'precision': 0.9961964, 'Recall': 1.0, 'F1 Score': 0.9980945708436912, 'iou': 0.9961964}, 'fire': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

Morganh · August 26, 2021, 7:29am

Did you train with TLT 3.0-dp-py3 or TLT 3.0-py3 ?
Is there training log which can be shared?

tiaotiaowang226 · August 26, 2021, 7:45am

My tlt info is:

ubuntu_1804:~$ tlt info
Configuration of the TLT Instance
dockers: ['nvidia/tlt-streamanalytics', 'nvidia/tlt-pytorch']
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

The training log of input_image_type: “grayscale” is as follows:
output.log (37.9 KB)

Morganh · August 26, 2021, 8:51am

Please

Update to the latest docker which is released yesterday. See Migrating from TAO Toolkit 3.x to TAO Toolkit 4.0 - NVIDIA Docs and Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
Using input_image_type: “color” since the fire images are color
Please do not modify the mask images. Just use the default ones in public dataset.
Train with below spec.

random_seed: 42
model_config {
model_input_width: 960
model_input_height: 544
model_input_channels: 3
num_layers: 18
all_projections: true
arch: “resnet”
use_batch_norm: False
training_precision {
backend_floatx: FLOAT32
}
}
training_config {
batch_size: 2
epochs: 500
log_summary_steps: 20
checkpoint_interval: 1
loss: “cross_dice_sum”
learning_rate:0.00001
regularizer {
type: L2
weight: 2e-5
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
}
dataset_config {
dataset: “custom”
augment: true
augmentation_config {
spatial_augmentation {
hflip_probability : 0.5
vflip_probability : 0.5
crop_and_resize_prob : 0.5
}
brightness_augmentation {
delta: 0.2
}
}
input_image_type: “color”
train_images_path:“/workspace/fire_detection/fire/images/train”
train_masks_path:“/workspace/fire_detection/fire/masks/train”

val_images_path:“/workspace/fire_detection/fire/images/val”
val_masks_path:“/workspace/fire_detection/fire/masks/val”

test_images_path:“/workspace/fire_detection/fire/images/val_resize”
data_class_config {
target_classes {
name: “fire”
mapping_class: “fire”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 1
}
}
}

tiaotiaowang226 · August 27, 2021, 6:24am

Thanks, but when I train the model just as you said, I get the error:

tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

The training log is output.log (5.4 KB)

Morganh · August 27, 2021, 7:40am

I did not meet this issue. Seems that you meet Nan even in 1st epoch.
Is this the first time you meet during training?

tiaotiaowang226 · August 27, 2021, 9:29am

Yes, I meet this issue for the first time, and all the settings are the same as you said.

ubuntu_1804:~$ tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

Morganh · August 27, 2021, 4:07pm

Can you try again? Please use a new result folder.

tiaotiaowang226 · August 30, 2021, 2:51am

Yes, I try to use a new result folder again, and I get the results_tlt.json and output.log as follows:

"{'fire': {'precision': 0.99351597, 'Recall': 1.0, 'F1 Score': 0.9967474393080977, 'iou': 0.99351597}, 'background': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

output.log (100.2 KB)

The results obtained are not satisfactory.

Morganh · August 30, 2021, 6:33am

Can you try to run inference via “tlt unet inference”?

tiaotiaowang226 · August 30, 2021, 6:53am

I have uninstalled nvidia-tlt before installing the nvidia-tao.

ubuntu_1804:~$ tlt unet inference --gpu_index=0 -e /workspace/fire_detection/fire/specs/unet_train_resnet_unet_isbi.txt -m /workspace/fire_detection/result/model.step-13300.tlt -o /workspace/fire_detection/result/ -k nvidia_tlt830

Command 'tlt' not found, did you mean:

  command 'tt' from deb treetop
  command 'tlf' from deb tlf
  command 'lt' from deb looptools
  command 'tgt' from deb tcm
  command 'llt' from deb storebackup
  command 'tla' from deb tla
  command 'tst' from deb pvm-examples
  command 'slt' from deb slt
  command 'tlp' from deb tlp
  command 'tilt' from deb ruby-tilt

Try: apt install <deb name>

Morganh · August 30, 2021, 7:27am

If you already update to tao, you can run "tao unet inference”.

tiaotiaowang226 · August 30, 2021, 7:42am

I run the command tao unet inference, but the result is poor.

I run tao unet evaluate, and get the the result_tlt.json is:

"{'fire': {'precision': 0.99351597, 'Recall': 1.0, 'F1 Score': 0.9967474393080977, 'iou': 0.99351597}, 'background': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

The precision is nan, and the iou is 0.

Topic		Replies	Views
Problem in training unet TAO Toolkit	22	1925	October 12, 2021
Training multi-class UNet does not converge TAO Toolkit	31	3033	October 12, 2021
Multiple classes not detected? TAO Toolkit	19	1019	October 12, 2021
TAO unet producing nan values TAO Toolkit	5	953	April 21, 2022
Segmentation with unet : shape error TAO Toolkit	8	1568	October 12, 2021
Fail with Transfer Learning with Unet Multiclass, Color Images, Semantic Segmentation TAO Toolkit	19	1632	February 8, 2022
Problem in training Unet with multi class labels TAO Toolkit	11	1106	February 8, 2022
Unet Inference Error TAO Toolkit	8	1164	December 13, 2021
Tao inference ValueError: could not broadcast input array from shape (4700160) into shape (1566720) TAO Toolkit	9	668	December 9, 2022
Cannot run tao unet dataset_convert because of docker mapping issue TAO Toolkit	6	809	March 24, 2023

Problems encountered in training unet and inference unet

Related topics