Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14]

No still the same.
I set num_examples_per_epoch:500. But still had problem.
I should set 4500 because 18000/4GPUs = 4500.
But I set small number 500, but still have problem.
My memory is 256M so no issue.
Never reached to more than 50% of usage.
But CPU% reached to 1000% or more than that before killed.
You can see in the attached file.
i used to train with 4GPUs and bs 4 before they update TAO with latest version.

Can you double check? Just to use your current spec file and run training with 21.11-tf15.5 docker.

If I run
tao mask_rcnn run /bin/bash
latest installed version 3.22.05 is activated.
How to run v3.21.11-tf1.15.5-py3?

You can use the way of "docker run xxx "

Sorry can’t.
Still same error.
I can run 1GPU with bs 2, evaluation_size 2 and num_examples_per_epoch:18000

But failed with the following error when I changed to 4 GPUs with bs 2, evaluation_size 2 and num_examples_per_epoch:4500

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 2ba4e56ebcdf exited on signal 9 (Killed).
--------------------------------------------------------------------------
2022-06-25 10:20:05,764 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

So, 21.11 and 22.05 have the same result, right?
Did you remember how did you get 4gpus running well?
Any change on dataset or spec file?

That time I was training same Mipillary dataset. Just that the mistake was I was using config1.2 for dataset version 2.0.
Now I changed to correct config 2.0 for dataset version 2.0
Then I can’t use multiple GPUs in training.
With 1 GPU, training can finish with bs2 and num_examples_per_epoch:18000.

With multiple GPUs, training always has issue using same bs2.

Please try to do more experiments.

  • Set a lower image_size.
    For example,
    image_size: “(640, 960)”

Yeah, I have concern about accuracy. Let me try.

Hello @edit_or, do you have more update, or should we close this topic? Thanks.

Yes can close. Thanks for support.