Training with pytorch works locally but gets stuck on AWS

Hello,

We are trying to train our networks on the AWS Cloud, implemented with PyTorch.
Those trainings work on our local computer installed with 2 GTX1080ti.
However, when we try to use the NVIDIA AMI with or without the Pytorch Container, our trainings are constantly stuck to a local minimum.
We tried several options :

  • Running with P3.2xlarge or G4dn.xlarge
  • Using NGC container or directly on the ami.
  • Copying the databases from S3 or mounting it from a snapshot.
  • Running with NVIDIA Ami or Deep Learning
    We also checked the python package versions and configurations.
    We also are totally sure that the code is exactly the same in local and on the cloud.
    The pytorch and CUDA versions are the same locally and on the cloud
    We tested with 2 differents versions of NVIDIA Drivers, with the same results.

Does someone has an idea on how to fix that or where could it come from?

Thanks in advance!

Are the gradients on your local computer and on AWS identical? I suspect that they’re not, as they have different numbers of GPUs.

Things you can try to get a handle on this:

  • Fix the SAME number of GPUs on local and AWS, e.g., 0 (cpu only), 1, & 2.
  • Since you’re getting a problem with convergence, each step along the way may differ, so it’s probably sufficient to look at a few epochs or even one epoch, shortening debugging time. Try smaller datasets until you root out the problem. You need to establish the cause for differing loss and gradients.
  • Compare gradients: Fix random number generator seeds on all machines. On local, save your gradients, then transfer to AWS and load them and compute relative error to AWS gradients.
  • BatchNorm is computed per GPU, so expect different behavior with more GPUs. SyncBatchNorm may improve things.

See:
https://discuss.pytorch.org/t/dataparallel-results-in-a-different-network-compared-to-a-single-gpu-run/28635/4
https://discuss.pytorch.org/t/debugging-dataparallel-no-speedup-and-uneven-memory-allocation/1100/29
https://pytorch.org/docs/master/nn.html#torch.nn.SyncBatchNorm

Hi @ncatta Have you been able to solve this? Sounds exactly the same kind of error I have while training Nvidia’s Tacotron2 sample