Logging Validation loss for Tao-Unet model

nitinp14920914 · February 17, 2022, 3:49pm

Hello,

We are training our TAO-Unet-Model but it looks like in terms of logging we are only limited to Training Loss, but we also want to use the Validation Loss.

Can Somone help me if is there any prospect that can be used to log the Validation Loss as well in our TAO-UnetModel?

Thanks in Advance !!!

• Hardware (T4/V100)
• Network Type (Unet)
• TLT

Configuration of the TAO Toolkit Instance

dockers: 		
	nvidia/tao/tao-toolkit-tf: 			
		v3.21.11-tf1.15.5-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. augment
				2. bpnet
		v3.21.11-tf1.15.4-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. detectnet_v2
				2. faster_rcnn
	nvidia/tao/tao-toolkit-pyt: 			
		v3.21.11-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. speech_to_text
				2. speech_to_text_citrinet
				3. text_classification
				4. question_answering
				5. token_classification
				6. intent_slot_classification
				7. punctuation_and_capitalization
				8. spectro_gen
				9. vocoder
				10. action_recognition
	nvidia/tao/tao-toolkit-lm: 			
		v3.21.08-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. n_gram
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

• Training spec file

random_seed: 42
dataset_config {
  augment: true
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "train_aug"
  train_masks_path: "trainannot_aug"
  val_images_path: "val"
  val_masks_path: "valannot"
  test_images_path: "test"
  data_class_config {
    target_classes {
      name: "background"
      mapping_class: "background"
    }
    target_classes {
      name: "***"
      label_id: 1
      mapping_class: "***"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
}
model_config {
  num_layers: 18
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 512
  model_input_width: 512
  model_input_channels: 3
}
training_config {
  batch_size: 16
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 1
  learning_rate: 9.999999747378752e-05
  loss: "cross_entropy"
  epochs: 200
  weights_monitor: true
}

Morganh · February 20, 2022, 3:04pm

Sorry for late reply, could you share the training log?

nitinp14920914 · February 21, 2022, 9:04am

Here you go we have run it for 200 epochs and our training log looks like this.

Loading experiment spec at /workspace/tao-experiments/*****/*****/******/unet_train_resnet_unet_isbi.txt.
Running for 200 Epochs
Epoch: 0/200:, Cur-Step: 0, loss(cross_entropy): 1.14998, Running average loss:1.14998, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 1, loss(cross_entropy): 1.13914, Running average loss:1.14456, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 2, loss(cross_entropy): 1.12712, Running average loss:1.13875, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 3, loss(cross_entropy): 1.11787, Running average loss:1.13353, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 4, loss(cross_entropy): 1.09061, Running average loss:1.12494, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 5, loss(cross_entropy): 1.08689, Running average loss:1.11860, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 6, loss(cross_entropy): 1.05918, Running average loss:1.11011, Time taken: 0:00:00 ETA: 0:00:00
Epoch: 0/200:, Cur-Step: 7, loss(cross_entropy): 1.01804, Running average loss:1.09860, Time taken: 0:00:00 ETA: 0:00:00
.
.
.
Epoch: 199/200:, Cur-Step: 18991, loss(cross_entropy): 0.08567, Running average loss:0.07906, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18992, loss(cross_entropy): 0.08031, Running average loss:0.07907, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18993, loss(cross_entropy): 0.08073, Running average loss:0.07909, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18994, loss(cross_entropy): 0.07960, Running average loss:0.07910, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18995, loss(cross_entropy): 0.07072, Running average loss:0.07900, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18996, loss(cross_entropy): 0.07482, Running average loss:0.07896, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18997, loss(cross_entropy): 0.07393, Running average loss:0.07890, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18998, loss(cross_entropy): 0.07139, Running average loss:0.07882, Time taken: 0:00:47.385825 ETA: 0:00:47.385825
Epoch: 199/200:, Cur-Step: 18999, loss(cross_entropy): 0.07419, Running average loss:0.07878, Time taken: 0:00:47.431175 ETA: 0:00:47.431175
Saving the final step model to /workspace/tao-experiments/Delivery2/*****/*****/******/train_out/weights/model.tlt

Morganh · February 21, 2022, 1:24pm

I will follow up with internal team for this feature.
You can run "tao evaluate evaluate " to run validation.

Morganh · February 23, 2022, 9:26am

It is a known constraint in Unet. The estimator only performs the evaluation at the end.
Internal team will review this new feature.

dacunaq · March 23, 2023, 10:24pm

Hello. Are there any updates on logging validation loss?
I would like to plot along the training loss in TensorBoard for results interpretation (overfitting/underfitting).
If not, does TAO toolkit offer an alternative to do this analysis?

Thank you in advance.

Morganh · March 24, 2023, 3:08am

Please refer to
https://docs.nvidia.com/tao/tao-toolkit/text/semantic_segmentation/unet.html

UNet supports Tensorboard visualization for losses, visualize the prediction mask on training images during training and Ground truth mask overlay on input images. The tensorboard logs are saved in the output/events directory in order to visualize them.

dacunaq · March 24, 2023, 6:31pm

I have indeed checked that resource. The spec file it provides as an example only has a training_config section. So to display the validation loss along the training loss should I add something like this?

validation_config {
  visualizer {
  enabled: true
  }
}

dacunaq · March 30, 2023, 7:06pm

So does TAO Unet not support validation loss logging?

Morganh · March 31, 2023, 6:26am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Hi, Unet does not support validation during training.

system · May 2, 2023, 4:09am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Validation and test data loss chart TAO Toolkit	2	778	March 10, 2022
Training Metric Learning Recognition models - how can we monitor validation metrics via tensorboard? TAO Toolkit	1	53	December 17, 2024
Unet with TAO TAO Toolkit	2	225	April 3, 2024
The training process of Tao-Toolkit-API unet is always in Inf status TAO Toolkit api , tao	61	2691	June 12, 2023
Run TAO training using unet.ipynb in Jupyter Notebook failed TAO Toolkit	4	538	August 1, 2022
UNet training progress counter frozen after ~18.000 steps TAO Toolkit	17	1043	October 20, 2023
Printing validation loss TAO Toolkit	5	609	October 12, 2021
MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model TAO Toolkit	29	1622	November 22, 2022
Problem in training unet TAO Toolkit	22	2073	October 12, 2021
TAO unet producing nan values TAO Toolkit	5	1011	April 21, 2022

Logging Validation loss for Tao-Unet model

Related topics