Training stopped by itself when training for instance segmentation using mask-rcnn

I am training mapillary-vistas-dataset with mask-rcnn for instance segmentation.
My system has the following infos.

• Hardware (Training on system with NVIDIA TITAN RTX(24G), Precision 7920 Tower with 32G memory)
• Network Type (Mask_rcnn using mapillary-vistas-dataset)

Training stops by itself. May I know why?

[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Killed
2022-05-11 10:07:21,449 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The whole log file and spec files are attached.
log.txt (19.6 KB)
maskrcnn_train_resnet50.txt (2.0 KB)

It may be due to out-of-memory(OOM). Could you set less tfrecord and retry?

Yes true. Now I take out some tfrecord files and start training.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.