I am trying to train a multi-class semantic segmentation network using Transfer Learning Toolkit 3.0, UNet and the BDD100K dataset. The dataset contains 10000 jpeg images (7k/1k/2k train/val/test split). The image size is 1280x720, and the images (except the test set) are associated with 8-bit single-channel png images that contain ground truth segmentations. There are 19 classes plus “unknown”, as described here: Label Format — BDD100K documentation
My problem is that the training always stops prematurely as the loss becomes NaN:
ERROR:tensorflow:Model diverged with loss = NaN.
Below is by far the most accurate inference result that I have seen so far. Green is sky, blue is road, and the red pixels belong to the “unknown” class. (I’m not sure if I’m allowed to post the original image here, but you can probably imagine that the upper part should be sky, the lower part should be road, and on the sides there should be vegetation.) This result corresponds to the step 4375 of the spec file below. The loss function value slightly decreases at the very beginning of the training, and after about step 5000 it starts to increase, eventually leading to a failure.
random_seed: 42
dataset_config {
augment: true
dataset: "custom"
input_image_type: "color"
train_images_path: "/bdd100k/images/10k/train"
train_masks_path: "/bdd100k/labels/sem_seg/masks/train"
val_images_path: "/bdd100k/images/10k/val"
val_masks_path: "/bdd100k/labels/sem_seg/masks/val"
test_images_path: "/bdd100k/images/10k/val"
data_class_config {
target_classes {
name: "road"
mapping_class: "road"
}
target_classes {
name: "sidewalk"
label_id: 1
mapping_class: "sidewalk"
}
target_classes {
name: "building"
label_id: 2
mapping_class: "building"
}
target_classes {
name: "wall"
label_id: 3
mapping_class: "wall"
}
target_classes {
name: "fence"
label_id: 4
mapping_class: "fence"
}
target_classes {
name: "pole"
label_id: 5
mapping_class: "pole"
}
target_classes {
name: "traffic light"
label_id: 6
mapping_class: "traffic light"
}
target_classes {
name: "traffic sign"
label_id: 7
mapping_class: "traffic sign"
}
target_classes {
name: "vegetation"
label_id: 8
mapping_class: "vegetation"
}
target_classes {
name: "terrain"
label_id: 9
mapping_class: "terrain"
}
target_classes {
name: "sky"
label_id: 10
mapping_class: "sky"
}
target_classes {
name: "person"
label_id: 11
mapping_class: "person"
}
target_classes {
name: "rider"
label_id: 12
mapping_class: "rider"
}
target_classes {
name: "car"
label_id: 13
mapping_class: "car"
}
target_classes {
name: "truck"
label_id: 14
mapping_class: "truck"
}
target_classes {
name: "bus"
label_id: 15
mapping_class: "bus"
}
target_classes {
name: "train"
label_id: 16
mapping_class: "train"
}
target_classes {
name: "motorcycle"
label_id: 17
mapping_class: "motorcycle"
}
target_classes {
name: "bicycle"
label_id: 18
mapping_class: "bicycle"
}
target_classes {
name: "unknown"
label_id: 255
mapping_class: "unknown"
}
}
augmentation_config {
spatial_augmentation {
hflip_probability: 0.25
crop_and_resize_prob: 0.5
}
brightness_augmentation {
delta: 0.20000000298023224
}
}
}
model_config {
num_layers: 50
training_precision {
backend_floatx: FLOAT32
}
arch: "resnet"
model_input_height: 288
model_input_width: 512
model_input_channels: 3
}
training_config {
batch_size: 8
regularizer {
type: L2
weight: 1.9999999494757503e-05
}
optimizer {
adam {
epsilon: 1.0000000116860974e-07
beta1: 0.8999999761581421
beta2: 0.9990000128746033
}
}
checkpoint_interval: 5
log_summary_steps: 175
learning_rate: 9.999999974752427e-07
loss: "cross_entropy"
epochs: 240
}
The initial weights are downloaded from NGC: https://ngc.nvidia.com/catalog/models/nvidia:tlt_semantic_segmentation/files
The command line inside the container is:
unet train --use_amp \
-e /workspace/spec_file.txt \
-m /workspace/tlt_semantic_segmentation_vresnet50/resnet_50.hdf5 \
-r /workspace/output \
-k my_key
I have tried at least the following adjustments without achieving convergence:
- Different architectures (resnet_18, resnet_50, efficientnet_b0_swish)
- Different network input sizes between 224x224 and 1280x720
- Different batch sizes between 1 and 8
- Batch normalization on and off
- Different learning rates between 1e-08 and 1e-04
- Different regularization weights between 0 and 1e-03
- Converting input images from jpeg to png before training
- Manually resizing the input images and labels to match the network input
- Shifting the integer labels so that “unknown” becomes 0 and all other classes become positive integers starting from 1
Moreover, I am unable to reproduce the results, because the results are different every time even though random_seed
in the spec file is fixed. Is this normal?
I also don’t understand the purpose of mapping_class
in target_classes
in the spec file. PNG labels can only contain integers, not strings, but mapping_class
is set to strings in the documentation.
Additional information:
- Hardware: V100 16GB
- TLT docker image: nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3