So, can you try 3gpus?
3 GPus also worked.
Please check nvidia-smi again.
Can you kill the job and run 4gpus again?
$ sudo kill -9 173358 173519 173520 173521
still same error
Can you try below experiments?
case1
#
mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_3gpu -k nvidia_tlt --gpus 3
--
gpu_index 0 1 2
case2
#
mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_3gpu -k nvidia_tlt --gpus 3
--
gpu_index 0 1 3
case3
#
mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_3gpu -k nvidia_tlt --gpus 3
--
gpu_index 1 2 3
All worked.
How about
mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_4gpu_new -k nvidia_tlt --gpus 4
--
gpu_index 1 2 3 4
Still same errors.
Is it memory issue?
You can set to bs to 1 and retry.
train_batch_size: 1
eval_batch_size: 1
or use less tfrecords.
training_file_pattern: “/workspace/tao-experiments/data/mask_rcnn/data/train*.tfrecord”
validation_file_pattern: “/workspace/tao-experiments/data/mask_rcnn/data/val*.tfrecord”
Yes it works with
train_batch_size: 1
eval_batch_size: 1
for 4 GPUs.
It is stranged, I used to train with 4 GPUs with
train_batch_size: 4
eval_batch_size: 4
for same dataset.
Now can’t
Try a lower num_examples_per_epoch
.
That could be the reason. Let me try.
I have 18,000 training images.
What could be appropriate size to set at num_examples_per_epoch
?
I like to train on 4 GPUs
Suggest you to open a new terminal to monitor the CPU memory.
Firstly, reproduce above error when run with 4gpus.
Then, decrease num_examples_per_epoch until the issue is not reproduced.
ok thanks
It doesn’t affect my number of images used for training, right?
More epochs will be trained for fewer number of num_examples_per_epoch
The num_examples_per_epoch should be the total number of images in the training set divided by the number of GPUs.