UNet training progress counter frozen after ~18.000 steps

Morganh · October 20, 2023, 8:51am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Hi,
When you run training inside the 5.0 docker, please change /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/unet/scripts/train.py line199~207 as below.
It will fix the issue.

    # Initialize env for AMP training
    if params.use_amp:
        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
        # Enable automatic loss scaling
        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_LOSS_SCALING"] = '1'
    else:
        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'

Thanks.

Topic		Replies	Views
UNet Training on Tao toolkit is getting stuck TAO Toolkit	7	38	September 2, 2024
Problems encountered in training unet and inference unet TAO Toolkit inference-server-triton	27	2690	October 12, 2021
Training multi-class UNet does not converge TAO Toolkit	31	2947	October 12, 2021
TAO Toolkit trainung Unet stops when saving checkpoints TAO Toolkit	19	74	September 3, 2024
MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model TAO Toolkit	29	1342	November 22, 2022
Cannot run tao unet dataset_convert because of docker mapping issue TAO Toolkit	6	767	March 24, 2023
Poor metric results after retraining maskrcnn using TLT notebook TAO Toolkit	23	2403	August 3, 2021
Detectnet_v2 Resume Training from Checkpoint TAO Toolkit	11	1319	October 12, 2021
Tao unet "TrainingConfig" has no field named "activation" TAO Toolkit	9	774	November 7, 2022
Training emotionnet with tao toolkit through Jupyter Notebook TAO Toolkit	26	892	December 12, 2022

UNet training progress counter frozen after ~18.000 steps

Related topics