Training multi-class UNet does not converge

I am trying to train a multi-class semantic segmentation network using Transfer Learning Toolkit 3.0, UNet and the BDD100K dataset. The dataset contains 10000 jpeg images (7k/1k/2k train/val/test split). The image size is 1280x720, and the images (except the test set) are associated with 8-bit single-channel png images that contain ground truth segmentations. There are 19 classes plus “unknown”, as described here: Label Format — BDD100K documentation

My problem is that the training always stops prematurely as the loss becomes NaN:
ERROR:tensorflow:Model diverged with loss = NaN.

Below is by far the most accurate inference result that I have seen so far. Green is sky, blue is road, and the red pixels belong to the “unknown” class. (I’m not sure if I’m allowed to post the original image here, but you can probably imagine that the upper part should be sky, the lower part should be road, and on the sides there should be vegetation.) This result corresponds to the step 4375 of the spec file below. The loss function value slightly decreases at the very beginning of the training, and after about step 5000 it starts to increase, eventually leading to a failure.

random_seed: 42
dataset_config {
  augment: true
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "/bdd100k/images/10k/train"
  train_masks_path: "/bdd100k/labels/sem_seg/masks/train"
  val_images_path: "/bdd100k/images/10k/val"
  val_masks_path: "/bdd100k/labels/sem_seg/masks/val"
  test_images_path: "/bdd100k/images/10k/val"
  data_class_config {
    target_classes {
      name: "road"
      mapping_class: "road"
    target_classes {
      name: "sidewalk"
      label_id: 1
      mapping_class: "sidewalk"
    target_classes {
      name: "building"
      label_id: 2
      mapping_class: "building"
    target_classes {
      name: "wall"
      label_id: 3
      mapping_class: "wall"
    target_classes {
      name: "fence"
      label_id: 4
      mapping_class: "fence"
    target_classes {
      name: "pole"
      label_id: 5
      mapping_class: "pole"
    target_classes {
      name: "traffic light"
      label_id: 6
      mapping_class: "traffic light"
    target_classes {
      name: "traffic sign"
      label_id: 7
      mapping_class: "traffic sign"
    target_classes {
      name: "vegetation"
      label_id: 8
      mapping_class: "vegetation"
    target_classes {
      name: "terrain"
      label_id: 9
      mapping_class: "terrain"
    target_classes {
      name: "sky"
      label_id: 10
      mapping_class: "sky"
    target_classes {
      name: "person"
      label_id: 11
      mapping_class: "person"
    target_classes {
      name: "rider"
      label_id: 12
      mapping_class: "rider"
    target_classes {
      name: "car"
      label_id: 13
      mapping_class: "car"
    target_classes {
      name: "truck"
      label_id: 14
      mapping_class: "truck"
    target_classes {
      name: "bus"
      label_id: 15
      mapping_class: "bus"
    target_classes {
      name: "train"
      label_id: 16
      mapping_class: "train"
    target_classes {
      name: "motorcycle"
      label_id: 17
      mapping_class: "motorcycle"
    target_classes {
      name: "bicycle"
      label_id: 18
      mapping_class: "bicycle"
    target_classes {
      name: "unknown"
      label_id: 255
      mapping_class: "unknown"
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.25
      crop_and_resize_prob: 0.5
    brightness_augmentation {
      delta: 0.20000000298023224
model_config {
  num_layers: 50
  training_precision {
    backend_floatx: FLOAT32
  arch: "resnet"
  model_input_height: 288
  model_input_width: 512
  model_input_channels: 3
training_config {
  batch_size: 8
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  optimizer {
    adam {
      epsilon: 1.0000000116860974e-07
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
  checkpoint_interval: 5
  log_summary_steps: 175
  learning_rate: 9.999999974752427e-07
  loss: "cross_entropy"
  epochs: 240

The initial weights are downloaded from NGC: NVIDIA NGC

The command line inside the container is:

unet train --use_amp \
  -e /workspace/spec_file.txt \
  -m /workspace/tlt_semantic_segmentation_vresnet50/resnet_50.hdf5 \
  -r /workspace/output \
  -k my_key

I have tried at least the following adjustments without achieving convergence:

  • Different architectures (resnet_18, resnet_50, efficientnet_b0_swish)
  • Different network input sizes between 224x224 and 1280x720
  • Different batch sizes between 1 and 8
  • Batch normalization on and off
  • Different learning rates between 1e-08 and 1e-04
  • Different regularization weights between 0 and 1e-03
  • Converting input images from jpeg to png before training
  • Manually resizing the input images and labels to match the network input
  • Shifting the integer labels so that “unknown” becomes 0 and all other classes become positive integers starting from 1

Moreover, I am unable to reproduce the results, because the results are different every time even though random_seed in the spec file is fixed. Is this normal?

I also don’t understand the purpose of mapping_class in target_classes in the spec file. PNG labels can only contain integers, not strings, but mapping_class is set to strings in the documentation.

Additional information:

From UNET — Transfer Learning Toolkit 3.0 documentation

  • mapping_class (string): The name of the mapping class for the target class. For example, “car” can be mapped to “vehicle”. If the class needs to be trained as is, then name and mapping_class should be the same.
  • label_id (int): The pixel that belongs to this target class is assigned this label_id value in the mask image.

Does UNet in TLT 3.0 support multiple (i.e., more than two) classes?

Yes, it can.
See one user case. Different result between tlt-infer and trt engine unet segmentation model

  1. How many classes in the mask label image? You can use np.unique to check.
    And what is the pixel value ?
    If there are 19 classes , label_id should belong to 0-18
    And, label_id should start from 0.
  2. Make sure the mask image is gray image.

That looks good. I will try the Mapillary Vistas dataset a bit later, but meanwhile I used the Resnet18 spec file from that topic to train with the BDD100k dataset that I have. The loss became NaN already during the second epoch. Here is the spec file and the corresponding output:
spec.txt (2.8 KB)
output.log (2.1 KB)

There are 20 classes. Of course not every image has all 20 classes in it. Pixel values in the original masks are 0-18 as well as 255 as decribed in the documentation that I linked earlier. I changed all 255 values to 19 so that the numbering is contiguous. I used the following Python script to retain the single-channelness of the PNG images:

# Warning: This will overwrite existing masks, so please have a backup
import cv2 as cv
for file in list_of_mask_filenames:
    image = cv.imread(file, cv.IMREAD_UNCHANGED)
    image[image==255] = 19
    cv.imwrite(file, image)

I used the file program in Linux to verify that the images are still gray. As an example:

/bdd100k/labels/sem_seg/masks/val_255to19/7ee5d536-808b2dd5.png: PNG image data, 1280 x 720, 8-bit grayscale, non-interlaced

After changing all 255 values to 19, the training still encounters NaN. Here is the spec file and the output:
spec_255to19.txt (2.8 KB)
output_255to19.log (43.7 KB)

As we can see, the training went much longer after the 255->19 adjustment, but I’m not sure if it is actually because of the adjustment. As I said earlier, the results are not reproducible even if I keep everything fixed.

I’m now re-running the previous experiment, this time with automatic mixed precision disabled, and it seems to work! At least the losses during the first ten epochs were much lower than ever before. Is this a known issue? Here are the full details of the system:

For above experiment, how about the inference result?

The inference result is not perfect, but at least it is very different. Next, I will re-run the experiment from the first post of this topic, but without automatic mixed precision.

Above image is inference result with an intermediate epoch’s tlt model, right?
BTW, how did you set AMP in the first post of this topic? Can you share the command?

Please update to latest docker which is released yesterday. See Migrating to TAO Toolkit — TAO Toolkit 3.0 documentation

That’s correct. I didn’t let the training to finish, because the result looked buggy. However, after some further analysis, I think the training was actually going to the right direction. The reason why the above image looks buggy is because there seems to be a bug in the visualization process when the aspect ratio of the network input (and thus the output mask) is different than the aspect ratio of the test image. It is not well documented how the result masks should be interpreted in cases like this, but it seems that the mask always contain the relevant content in the middle, and the boundaries are padded with zeros if necessary. When the mask is overlaid on top of the test image, however, the mask is squeezed instead of cropped, and therefore there are zero-padded areas near the boundaries of the overlay image, and the relevant content does not align with the corresponding content of the test image. In this case, since the network was square but the test image was 16:9, we can see zero-padded bands on top and at the bottom.

I’m still running the first experiment of this topic without AMP. The training definitely seems to be going to the right direction, so I would conclude that disabling AMP was the solution to this problem. I will verify this once the training is complete.

I used the --use_amp flag. The command can be found from the first post in this topic.

I haven’t tried the latest docker image yet, but will do so soon.

Thanks for the finding. I will double check the AMP.
More, the latest version 21.08( did a change for resizing and visualization in inference. So, suggest to use it if you have time.

So I still think that the NaN problem was solved by disabling AMP. However, I’m still unable to get very good results. I’m now using the new TAO docker image and the following spec file.
model.txt (2.8 KB)

After 120 epochs, the inference result (colorized with my own colormap) looks like this:

As a comparison, the ground truth with the same colormap looks like this:

The loss looks like this:

While the loss is not diverging (like it did with AMP), it’s not really converging either. What we see here is the result after 210000 steps which took about 24 hours to train. I also tried some intermediate tlt files and they didn’t look any better.

Are there any publicly available multi-class datasets that are shown to work well with UNet? The other topic used Mapillary Vistas to get reasonable results, but the masks in Mapillary Vistas require some amount of processing before they can be given to UNet. So I’m wondering if there is a dataset that would work out-of-the-box.

Thanks for the result. Could you explain more about your finding for “the masks in Mapillary Vistas require some amount of processing before they can be given to UNet”?

I thought the masks were 3-channel images, but now when I took another look, they are actually single-channel 8-bit images which should work directly with UNet. For some reason, the mapping between class names and class numbers is done via the 3-channel colormap values that are also included in the png files. That’s a bit confusing, but should not affect training with UNet. So I’m going to see how the training works with Mapillary Vistas.

I tried Mapillary Vistas but it didn’t change anything. The spec file is the same as in the other topic, except for the file paths:
model.txt (19.4 KB)

I launched the docker container like this:

docker run -it --gpus all \
  -v /path/to/mapillary_vistas_2:/mapillary_vistas_2:ro \
  -v /path/to/my_folder:/workspace/my_folder \

I downloaded the pretrained model:

ngc registry model download-version nvidia/tao/pretrained_semantic_segmentation:resnet18

Then I trained the model like this:

unet train \
  -e /workspace/my_folder/model.txt \
  -m /workspace/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 \
  -r /workspace/my_folder/output \
  -k my_key

This is the training loss:

And then I run inference like this:

unet inference \
  -e /workspace/my_folder/model.txt \
  -m /workspace/my_folder/output/weights/model.tlt \
  -o /workspace/my_folder/infer \
  -k my_key

This is the result (I resized the image before uploading it):

I have also tried another computer (GTX 1080 with 465.19.01 drivers). I have also tried to rename all files so that the filenames consist of sequential numbers with leading zeros, but that also didn’t change anything. I don’t know what to try next.

Not sure what is happening. I’m asking more info from other topics.

Please try again with previous 3.0-py3 docker.
But please note that in that version, see Open Model Architectures — Transfer Learning Toolkit 3.0 documentation

The train tool does not support training on images of multiple resolutions. All of the images and masks must be of equal size. However, image and masks need not be necessarily equal to model input size. The images/ masks will be resized to the model input size during training.

You need to resize the images or labels to be of equal size.
And please note that the model input size should be multiples of 32.

All above assumptions were already satisfied with my first experiments using the BDD100k dataset. However, I did the necessary resizing for Mapillary Vistas and tried it again. The results didn’t improve at all.

This is my Python script that resizes images and labels. Instead of padding, I simply stretch them to be 512x512. This shouldn’t be a major issue, because all images are already somewhat close to square.

#!/usr/bin/env python3

import os
import cv2 as cv
import numpy as np
from PIL import Image

TRAIN_IMAGE_DIR = '/path/to/mapillary_vistas_2/training/images'
TRAIN_LABEL_DIR = '/path/to/mapillary_vistas_2/training/v2.0/labels'
VAL_IMAGE_DIR = '/path/to/mapillary_vistas_2/validation/images'
VAL_LABEL_DIR = '/path/to/mapillary_vistas_2/validation/v2.0/labels'

target_size = (512, 512)
dir_suffix = '_{}x{}'.format(*target_size)

def resize_image(filename_old, filename_new):
    image = cv.imread(filename_old, cv.IMREAD_UNCHANGED)
    image = cv.resize(image, dsize=target_size, interpolation=cv.INTER_AREA)
    cv.imwrite(filename_new, image)

def resize_label(filename_old, filename_new):
    image = np.array('P'))
    image = cv.resize(image, dsize=target_size, interpolation=cv.INTER_NEAREST)
    cv.imwrite(filename_new, image)

def process_dir(dirname_old, resize_fun):
    dirname_new = os.path.normpath(dirname_old) + dir_suffix
    if os.path.exists(dirname_new):
        print('{} already exists, skipping'.format(dirname_new))
    files = os.listdir(dirname_old)
    for f in files:
        filename_old = os.path.join(dirname_old, f)
        filename_new = os.path.join(dirname_new, f)
        resize_fun(filename_old, filename_new)

for d in image_dirs:
    process_dir(d, resize_image)

for d in label_dirs:
    process_dir(d, resize_label)

Here is an example label:

The spec file is the same as before, except that file paths adjusted for the resized images and labels:
model.txt (19.5 KB)

All commands are the same as in the previous post, except I used the TLT docker image instead of TAO. Also, “my_folder” is a new empty folder.

This is the training loss:

And this is the inference mask corresponding to the label above. In fact, in this case the inference contains only one value, namely 11:

We can probably conclude that Volta architecture is not supported by UNet. I have tested with V100 and GTX 1080. I don’t have access to non-Volta GPUs right now, but the few others that have been able to successfully use UNet seem to be using other GPU architectures than Volta.