Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14]

So, can you try 3gpus?

3 GPus also worked.

Please check nvidia-smi again.

Can you kill the job and run 4gpus again?
$ sudo kill -9 173358 173519 173520 173521

still same error

Can you try below experiments?
case1
# mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_3gpu -k nvidia_tlt --gpus 3
--gpu_index 0 1 2

case2
# mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_3gpu -k nvidia_tlt --gpus 3
--gpu_index 0 1 3

case3
# mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_3gpu -k nvidia_tlt --gpus 3
--gpu_index 1 2 3

All worked.

How about
mask_rcnn train -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned_4gpu_new -k nvidia_tlt --gpus 4
-- gpu_index 1 2 3 4

Still same errors.
Is it memory issue?

You can set to bs to 1 and retry.
train_batch_size: 1
eval_batch_size: 1

or use less tfrecords.
training_file_pattern: “/workspace/tao-experiments/data/mask_rcnn/data/train*.tfrecord”
validation_file_pattern: “/workspace/tao-experiments/data/mask_rcnn/data/val*.tfrecord”

Yes it works with

train_batch_size: 1
eval_batch_size: 1

for 4 GPUs.

It is stranged, I used to train with 4 GPUs with

train_batch_size: 4
eval_batch_size: 4

for same dataset.
Now can’t

Try a lower num_examples_per_epoch.

That could be the reason. Let me try.

I have 18,000 training images.
What could be appropriate size to set at num_examples_per_epoch?
I like to train on 4 GPUs

Suggest you to open a new terminal to monitor the CPU memory.
Firstly, reproduce above error when run with 4gpus.
Then, decrease num_examples_per_epoch until the issue is not reproduced.

ok thanks

It doesn’t affect my number of images used for training, right?
More epochs will be trained for fewer number of num_examples_per_epoch

The num_examples_per_epoch should be the total number of images in the training set divided by the number of GPUs.