Training with pytorch works locally but gets stuck on AWS

ncatta · January 10, 2020, 10:18am

Hello,

We are trying to train our networks on the AWS Cloud, implemented with PyTorch.
Those trainings work on our local computer installed with 2 GTX1080ti.
However, when we try to use the NVIDIA AMI with or without the Pytorch Container, our trainings are constantly stuck to a local minimum.
We tried several options :

Running with P3.2xlarge or G4dn.xlarge
Using NGC container or directly on the ami.
Copying the databases from S3 or mounting it from a snapshot.
Running with NVIDIA Ami or Deep Learning
We also checked the python package versions and configurations.
We also are totally sure that the code is exactly the same in local and on the cloud.
The pytorch and CUDA versions are the same locally and on the cloud
We tested with 2 differents versions of NVIDIA Drivers, with the same results.

Does someone has an idea on how to fix that or where could it come from?

Thanks in advance!

jonas.m.august · February 17, 2020, 1:01pm

Are the gradients on your local computer and on AWS identical? I suspect that they’re not, as they have different numbers of GPUs.

Things you can try to get a handle on this:

Fix the SAME number of GPUs on local and AWS, e.g., 0 (cpu only), 1, & 2.
Since you’re getting a problem with convergence, each step along the way may differ, so it’s probably sufficient to look at a few epochs or even one epoch, shortening debugging time. Try smaller datasets until you root out the problem. You need to establish the cause for differing loss and gradients.
Compare gradients: Fix random number generator seeds on all machines. On local, save your gradients, then transfer to AWS and load them and compute relative error to AWS gradients.
BatchNorm is computed per GPU, so expect different behavior with more GPUs. SyncBatchNorm may improve things.

See:
https://discuss.pytorch.org/t/dataparallel-results-in-a-different-network-compared-to-a-single-gpu-run/28635/4
https://discuss.pytorch.org/t/debugging-dataparallel-no-speedup-and-uneven-memory-allocation/1100/29
https://pytorch.org/docs/master/nn.html#torch.nn.SyncBatchNorm

dfgr0316 · July 16, 2020, 4:02pm

Hi @ncatta Have you been able to solve this? Sounds exactly the same kind of error I have while training Nvidia’s Tacotron2 sample

Topic		Replies	Views
NVIDIA GPU Optimized AMI is missing drivers and can't run the PyTorch NGC Dockerfile Amazon Web Services (AWS)	1	1234	January 12, 2024
Problems using NVIDIA GPUs on AWS & Azure while on local works fine Docker and NVIDIA Docker	0	450	January 25, 2022
Cloud Vendor agnostic Pytorch CUDA docker image Frameworks (archived)	0	653	February 26, 2023
Has anyone(!) managed recently to get AWS and NVIDIA Digits easily working? Amazon Web Services (AWS)	9	2057	November 26, 2018
NVIDIA Base AMI does not work with Ray on AWS GPUs Amazon Web Services (AWS)	0	433	February 14, 2024
V100 GPUs not recognised within the container Container: CUDA cuda	0	857	December 7, 2022
Update for Pytorch container 20.03-py3? NGC GPU Cloud pytorch	0	394	June 22, 2020
Hello AI World Training Cat/Dog Jetson Nano ai-training	7	502	March 29, 2024
PyTorch utilize CPU instead of GPU CUDA on Windows Subsystem for Linux	5	2909	November 25, 2020
AttributeError: module 'torch.cuda' has no attribute 'amp' Docker and NVIDIA Docker	0	1095	December 15, 2021

Training with pytorch works locally but gets stuck on AWS

Related topics