Logging Validation loss for Tao-Unet model

Hello,

We are training our TAO-Unet-Model but it looks like in terms of logging we are only limited to Training Loss, but we also want to use the Validation Loss.

Can Somone help me if is there any prospect that can be used to log the Validation Loss as well in our TAO-UnetModel?

Thanks in Advance !!!

• Hardware (T4/V100)
• Network Type (Unet)
• TLT

Configuration of the TAO Toolkit Instance

dockers: 		
	nvidia/tao/tao-toolkit-tf: 			
		v3.21.11-tf1.15.5-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. augment
				2. bpnet
		v3.21.11-tf1.15.4-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. detectnet_v2
				2. faster_rcnn
	nvidia/tao/tao-toolkit-pyt: 			
		v3.21.11-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. speech_to_text
				2. speech_to_text_citrinet
				3. text_classification
				4. question_answering
				5. token_classification
				6. intent_slot_classification
				7. punctuation_and_capitalization
				8. spectro_gen
				9. vocoder
				10. action_recognition
	nvidia/tao/tao-toolkit-lm: 			
		v3.21.08-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. n_gram
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file

random_seed: 42
dataset_config {
  augment: true
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "train_aug"
  train_masks_path: "trainannot_aug"
  val_images_path: "val"
  val_masks_path: "valannot"
  test_images_path: "test"
  data_class_config {
    target_classes {
      name: "background"
      mapping_class: "background"
    }
    target_classes {
      name: "***"
      label_id: 1
      mapping_class: "***"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
}
model_config {
  num_layers: 18
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 512
  model_input_width: 512
  model_input_channels: 3
}
training_config {
  batch_size: 16
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 1
  learning_rate: 9.999999747378752e-05
  loss: "cross_entropy"
  epochs: 200
  weights_monitor: true
}

Sorry for late reply, could you share the training log?

Here you go we have run it for 200 epochs and our training log looks like this.

Loading experiment spec at /workspace/tao-experiments/*****/*****/******/unet_train_resnet_unet_isbi.txt.
Running for 200 Epochs
Epoch: 0/200:, Cur-Step: 0, loss(cross_entropy): 1.14998, Running average loss:1.14998, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 1, loss(cross_entropy): 1.13914, Running average loss:1.14456, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 2, loss(cross_entropy): 1.12712, Running average loss:1.13875, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 3, loss(cross_entropy): 1.11787, Running average loss:1.13353, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 4, loss(cross_entropy): 1.09061, Running average loss:1.12494, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 5, loss(cross_entropy): 1.08689, Running average loss:1.11860, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 6, loss(cross_entropy): 1.05918, Running average loss:1.11011, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 7, loss(cross_entropy): 1.01804, Running average loss:1.09860, Time taken: 0:00:00 ETA: 0:00:00
.
.
.
Epoch: 199/200:, Cur-Step: 18991, loss(cross_entropy): 0.08567, Running average loss:0.07906, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18992, loss(cross_entropy): 0.08031, Running average loss:0.07907, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18993, loss(cross_entropy): 0.08073, Running average loss:0.07909, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18994, loss(cross_entropy): 0.07960, Running average loss:0.07910, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18995, loss(cross_entropy): 0.07072, Running average loss:0.07900, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18996, loss(cross_entropy): 0.07482, Running average loss:0.07896, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18997, loss(cross_entropy): 0.07393, Running average loss:0.07890, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18998, loss(cross_entropy): 0.07139, Running average loss:0.07882, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18999, loss(cross_entropy): 0.07419, Running average loss:0.07878, Time taken: 0:00:47.431175 ETA: 0:00:47.431175
Saving the final step model to /workspace/tao-experiments/Delivery2/*****/*****/******/train_out/weights/model.tlt

I will follow up with internal team for this feature.
You can run "tao evaluate evaluate " to run validation.

It is a known constraint in Unet. The estimator only performs the evaluation at the end.
Internal team will review this new feature.