Problems encountered in training unet and inference unet

Did you train with TLT 3.0-dp-py3 or TLT 3.0-py3 ?
Is there training log which can be shared?

My tlt info is:

ubuntu_1804:~$ tlt info
Configuration of the TLT Instance
dockers: ['nvidia/tlt-streamanalytics', 'nvidia/tlt-pytorch']
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

The training log of input_image_type: “grayscale” is as follows:
output.log (37.9 KB)

Please

  1. Update to the latest docker which is released yesterday. See Migrating to TAO Toolkit — TAO Toolkit 3.0 documentation and NVIDIA NGC
  2. Using input_image_type: “color” since the fire images are color
  3. Please do not modify the mask images. Just use the default ones in public dataset.
  4. Train with below spec.

random_seed: 42
model_config {
model_input_width: 960
model_input_height: 544
model_input_channels: 3
num_layers: 18
all_projections: true
arch: “resnet”
use_batch_norm: False
training_precision {
backend_floatx: FLOAT32
}
}
training_config {
batch_size: 2
epochs: 500
log_summary_steps: 20
checkpoint_interval: 1
loss: “cross_dice_sum”
learning_rate:0.00001
regularizer {
type: L2
weight: 2e-5
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
}
dataset_config {
dataset: “custom”
augment: true
augmentation_config {
spatial_augmentation {
hflip_probability : 0.5
vflip_probability : 0.5
crop_and_resize_prob : 0.5
}
brightness_augmentation {
delta: 0.2
}
}
input_image_type: “color”
train_images_path:"/workspace/fire_detection/fire/images/train"
train_masks_path:"/workspace/fire_detection/fire/masks/train"

val_images_path:"/workspace/fire_detection/fire/images/val"
val_masks_path:"/workspace/fire_detection/fire/masks/val"

test_images_path:"/workspace/fire_detection/fire/images/val_resize"
data_class_config {
target_classes {
name: “fire”
mapping_class: “fire”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 1
}
}
}

Thanks, but when I train the model just as you said, I get the error:

tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

The training log is output.log (5.4 KB)

I did not meet this issue. Seems that you meet Nan even in 1st epoch.
Is this the first time you meet during training?

Yes, I meet this issue for the first time, and all the settings are the same as you said.

ubuntu_1804:~$ tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

Can you try again? Please use a new result folder.

Yes, I try to use a new result folder again, and I get the results_tlt.json and output.log as follows:

"{'fire': {'precision': 0.99351597, 'Recall': 1.0, 'F1 Score': 0.9967474393080977, 'iou': 0.99351597}, 'background': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

output.log (100.2 KB)

The results obtained are not satisfactory.

Can you try to run inference via “tlt unet inference”?

I have uninstalled nvidia-tlt before installing the nvidia-tao.

ubuntu_1804:~$ tlt unet inference --gpu_index=0 -e /workspace/fire_detection/fire/specs/unet_train_resnet_unet_isbi.txt -m /workspace/fire_detection/result/model.step-13300.tlt -o /workspace/fire_detection/result/ -k nvidia_tlt830

Command 'tlt' not found, did you mean:

  command 'tt' from deb treetop
  command 'tlf' from deb tlf
  command 'lt' from deb looptools
  command 'tgt' from deb tcm
  command 'llt' from deb storebackup
  command 'tla' from deb tla
  command 'tst' from deb pvm-examples
  command 'slt' from deb slt
  command 'tlp' from deb tlp
  command 'tilt' from deb ruby-tilt

Try: apt install <deb name>

If you already update to tao, you can run "tao unet inference”.

I run the command tao unet inference, but the result is poor.

I run tao unet evaluate, and get the the result_tlt.json is:

"{'fire': {'precision': 0.99351597, 'Recall': 1.0, 'F1 Score': 0.9967474393080977, 'iou': 0.99351597}, 'background': {'precision': nan, 'Recall': 0.0, 'F1 Score': nan, 'iou': 0.0}}"

The precision is nan, and the iou is 0.

Could you please check all the images? Is there any detection for the fire?

Using the same images and masks, the accuracy of training the model under the pytorch framework is good, but the training results using tlt or tao are bad. I don’t know what went wrong.

I check all the images, there is no detection for the fire.

Thanks for the info. Will check further.

Thanks, is there any progress now?

We still focus on this fire dataset and trigger experiments. With above spec file, actually one of our internal engineers can run successfully and get the correct inference result. But strangely another guy cannot. So, we’re still checking.

Please try below solution which is working on my side.
Change the training image from jpg to png.

$ for i in *.jpg ; do convert "$i" "${i%.*}.png" ; done

The fire can be detected during inference. And there is no Nan issue in evaluation.

More, after changing training .jpg files to .png files, you can also use below loss parameter.

  • loss: “cross_entropy”
  • weight: 2e-06
  • crop_and_resize_prob : 0.01
1 Like

Thanks, it works now.