We are trying to train our networks on the AWS Cloud, implemented with PyTorch.
Those trainings work on our local computer installed with 2 GTX1080ti.
However, when we try to use the NVIDIA AMI with or without the Pytorch Container, our trainings are constantly stuck to a local minimum.
We tried several options :
- Running with P3.2xlarge or G4dn.xlarge
- Using NGC container or directly on the ami.
- Copying the databases from S3 or mounting it from a snapshot.
- Running with NVIDIA Ami or Deep Learning
We also checked the python package versions and configurations.
We also are totally sure that the code is exactly the same in local and on the cloud.
The pytorch and CUDA versions are the same locally and on the cloud
We tested with 2 differents versions of NVIDIA Drivers, with the same results.
Does someone has an idea on how to fix that or where could it come from?
Thanks in advance!