Logging Validation loss for Tao-Unet model

Hello,

We are training our TAO-Unet-Model but it looks like in terms of logging we are only limited to Training Loss, but we also want to use the Validation Loss.

Can Somone help me if is there any prospect that can be used to log the Validation Loss as well in our TAO-UnetModel?

Thanks in Advance !!!

• Hardware (T4/V100)
• Network Type (Unet)
• TLT

Configuration of the TAO Toolkit Instance

dockers: 		
	nvidia/tao/tao-toolkit-tf: 			
		v3.21.11-tf1.15.5-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. augment
				2. bpnet
		v3.21.11-tf1.15.4-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. detectnet_v2
				2. faster_rcnn
	nvidia/tao/tao-toolkit-pyt: 			
		v3.21.11-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. speech_to_text
				2. speech_to_text_citrinet
				3. text_classification
				4. question_answering
				5. token_classification
				6. intent_slot_classification
				7. punctuation_and_capitalization
				8. spectro_gen
				9. vocoder
				10. action_recognition
	nvidia/tao/tao-toolkit-lm: 			
		v3.21.08-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. n_gram
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file

random_seed: 42
dataset_config {
  augment: true
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "train_aug"
  train_masks_path: "trainannot_aug"
  val_images_path: "val"
  val_masks_path: "valannot"
  test_images_path: "test"
  data_class_config {
    target_classes {
      name: "background"
      mapping_class: "background"
    }
    target_classes {
      name: "***"
      label_id: 1
      mapping_class: "***"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
}
model_config {
  num_layers: 18
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 512
  model_input_width: 512
  model_input_channels: 3
}
training_config {
  batch_size: 16
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 1
  learning_rate: 9.999999747378752e-05
  loss: "cross_entropy"
  epochs: 200
  weights_monitor: true
}

Sorry for late reply, could you share the training log?

Here you go we have run it for 200 epochs and our training log looks like this.

Loading experiment spec at /workspace/tao-experiments/*****/*****/******/unet_train_resnet_unet_isbi.txt.
Running for 200 Epochs
Epoch: 0/200:, Cur-Step: 0, loss(cross_entropy): 1.14998, Running average loss:1.14998, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 1, loss(cross_entropy): 1.13914, Running average loss:1.14456, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 2, loss(cross_entropy): 1.12712, Running average loss:1.13875, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 3, loss(cross_entropy): 1.11787, Running average loss:1.13353, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 4, loss(cross_entropy): 1.09061, Running average loss:1.12494, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 5, loss(cross_entropy): 1.08689, Running average loss:1.11860, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 6, loss(cross_entropy): 1.05918, Running average loss:1.11011, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 7, loss(cross_entropy): 1.01804, Running average loss:1.09860, Time taken: 0:00:00 ETA: 0:00:00
.
.
.
Epoch: 199/200:, Cur-Step: 18991, loss(cross_entropy): 0.08567, Running average loss:0.07906, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18992, loss(cross_entropy): 0.08031, Running average loss:0.07907, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18993, loss(cross_entropy): 0.08073, Running average loss:0.07909, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18994, loss(cross_entropy): 0.07960, Running average loss:0.07910, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18995, loss(cross_entropy): 0.07072, Running average loss:0.07900, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18996, loss(cross_entropy): 0.07482, Running average loss:0.07896, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18997, loss(cross_entropy): 0.07393, Running average loss:0.07890, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18998, loss(cross_entropy): 0.07139, Running average loss:0.07882, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18999, loss(cross_entropy): 0.07419, Running average loss:0.07878, Time taken: 0:00:47.431175 ETA: 0:00:47.431175
Saving the final step model to /workspace/tao-experiments/Delivery2/*****/*****/******/train_out/weights/model.tlt

I will follow up with internal team for this feature.
You can run "tao evaluate evaluate " to run validation.

It is a known constraint in Unet. The estimator only performs the evaluation at the end.
Internal team will review this new feature.

Hello. Are there any updates on logging validation loss?
I would like to plot along the training loss in TensorBoard for results interpretation (overfitting/underfitting).
If not, does TAO toolkit offer an alternative to do this analysis?

Thank you in advance.

Please refer to
https://docs.nvidia.com/tao/tao-toolkit/text/semantic_segmentation/unet.html

UNet supports Tensorboard visualization for losses, visualize the prediction mask on training images during training and Ground truth mask overlay on input images. The tensorboard logs are saved in the output/events directory in order to visualize them.

I have indeed checked that resource. The spec file it provides as an example only has a training_config section. So to display the validation loss along the training loss should I add something like this?

validation_config {
  visualizer {
  enabled: true
  }
}

So does TAO Unet not support validation loss logging?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Hi, Unet does not support validation during training.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.