Training stopped by itself when training for instance segmentation using mask-rcnn

edit_or · May 11, 2022, 2:35am

I am training mapillary-vistas-dataset with mask-rcnn for instance segmentation.
My system has the following infos.

• Hardware (Training on system with NVIDIA TITAN RTX(24G), Precision 7920 Tower with 32G memory)
• Network Type (Mask_rcnn using mapillary-vistas-dataset)

Training stops by itself. May I know why?

[MaskRCNN] INFO    : Saving checkpoints for 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.step-0.tlt.
Killed
2022-05-11 10:07:21,449 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The whole log file and spec files are attached.
log.txt (19.6 KB)
maskrcnn_train_resnet50.txt (2.0 KB)

Morganh · May 11, 2022, 7:04am

It may be due to out-of-memory(OOM). Could you set less tfrecord and retry?

edit_or · May 12, 2022, 8:32am

Yes true. Now I take out some tfrecord files and start training.

system · May 26, 2022, 8:32am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Training Interrupted during loading pretrained weights TAO Toolkit tao	4	346	March 7, 2024
The container stops in between TAO training TAO Toolkit	2	10	December 9, 2024
Tao mask_rcnn training exits with NaN loss TAO Toolkit	12	61	September 18, 2024
Docker container created from TAO toolkit image shuts down by itself TAO Toolkit	3	780	June 16, 2022
Unknown Reason for stopping detectnet_v2 training TAO Toolkit docker , jupyterlab , tao	5	769	May 19, 2022
Object Detection using TAO DetectNet_v2. Run TAO training stopped TAO Toolkit python	16	688	July 6, 2022
Retraining Error after pruning the Mask RCNN model with TAO Toolkit TAO Toolkit tao	5	504	May 10, 2022
NCCL WARN Call to posix_fallocate failed : No space left on device TAO Toolkit	3	647	January 9, 2023
Convert to TensorRT engine(FP16). Stop here TAO Toolkit	3	402	July 12, 2022
Killing the training process TAO Toolkit	5	427	March 2, 2022

Training stopped by itself when training for instance segmentation using mask-rcnn

Related topics