UNet training progress counter frozen after ~18.000 steps

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc): 1x Nvidia A5000 GPU
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): UNet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): tao 5.0
• Training spec file(If have, please share here) unet_train.txt (1.2 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Train the UNet model for more than ~18000 steps.

I am currently working on a semantic segmentation with TAO 5.0 toolkit. I have ~100.000 images with resolution 1216x1216, for which I want to train a UNet binary segmentation model.

I am using the smallest configurations for the UNet (ResNet10 backbone and input resolution of 608x608x3.

After training for ~18.000 steps, the progress counter (it is counted in fractions of the epoch) is stuck. Therefore, model training does not stop correctly after the specified number of epochs. This is the cmd output for when the training progress counter is frozen:
unet_train_output.txt (9.4 KB)

Could you please share the full log?

More, did you ever run official Unet notebook successfully?

Hi Morgan,

thanks for your reply. Here is the status.json file from training the UNet. status.json|attachment](upload://yTJD9yXF0ePkWCh2ozqX0QNbghF.json) (832 Bytes).

The full cmd is super long, since training is going for more than 18.000 steps. There is no error message, training is just stuck in an endless loop. This is because the the termination condition of reaching epoch 15 is never reached, since the epoch counter is stuck at epoch 1.889811/15.

I did not run the UNet jupyter notebook ye, since the isibi dataset link in the notebook only points to data for one train and one test image. My issue only happens after training for 18.000 steps, so i dont expect this would happen here. I did however succesfully train the UNet for one epoch, but when doing more than one epoch, I run into described problem

Best
tilman

status.json (832 Bytes)

From your training spec file, the epochs is set to 15.
Can you try to runbelow experiments to narrow down?

  • Try to set lower batch size, for example, batch size 2.
  • Please use lower model_input_width and model_input_height. For example, 320x320.
  • How about training for 3 epochs?

During training, please check if there is out-of-memory issue. You can monitor with nvidia-smi.

Hi Morganh,

I run training with a batch size of 2 and 320x320x3 for three epochs. However, the epoch and step counter are frozen at ~0.4 epochs and 15.000 steps. I am certain this is not a out-of-memory issue, since GB utilisation is at 6/24GB and the model is still training. However, as i said the epoch and step progress counter are frozen

I am a little confused about above steps. Some are 18000 steps but others are 18.000 steps. May I know which is correct?

More, could you share the full training log? Previously, what you shared is only a status.json file. I would like to take a look at the training log to check further as well. Thanks.

Hi Morganh,
this is the entire bash log:
unet_output.txt (4.9 MB)
You see that until step 15269, training is going normal. At step 15269, however, the step and epoch counter are frozen, but loss is still updating and training seems to be going on as normal. I had to stop the training run manually, since training seems to be stuck in an endless loop and no actions related to the epoch number like checkpoint saving are done.

To narrow down, could you please use 1000 images to train? You can generate a new folder and copy these 1000 images into it.

  train_images_path: "/data/_seg_data/images/train_1000"
  train_masks_path: "/data/_seg_data/masks/train_1000"

And also use these 1000 images for validation.

  val_images_path: "/data/_seg_data/images/train_1000"
  val_masks_path: "/data/_seg_data/images/train_1000"

Same problem after 15066 steps.
unet_output2.txt (5.6 MB)

You are using “docker run” and inside it running Unet training, right?

Still suspect this is a “Ram related” issue , you can temporarily increase the “SWAP” Memory in the Linux system. Once the training is done, you can release and delete the swap memory.

Or can you use less images(for example, 10 or 50 training images?) and retry?
If still happens, please share the minimum dataset with me to reproduce.

More, still suggest you to run official notebook to check if issue happens.

I am using “tao model unet train …” as in the notebook. Basically I am following the code in the notebook adapted to my dataset. So there is a lot of steps i dont need from the official notebook which are dataset specific.

In the tao_mounts.json i have shm_size set to 64GB, so its hard to imagine memory issues. As you can see in the log, the training is still continuing too and the loss is changing each step. I was also able to sucessfully train the Segformer model on the same dataset, so I am pretty sure its some issue with the UNet training code.

OK, I will try to reproduce.

If possible, could you share the minimum dataset which can reproduce this issue?
You can provide the link with me by using private message.

Ok, thanks. Unfortunately, i cant give you the data since its customer data. But I also get the error when using only 5 images for training and increasing the number of epochs. This is the log for this:
unet_output3.txt (7.1 MB)

I can reproduce with one public dataset now. Will check further.