Hello I tried running resnet50 and imagenet using the following commands:
python /opt/pytorch/examples/imagenet/main.py \
--arch=resnet50 \
--epochs=1 \
--batch-size=64 \
--lr=0.01 \
--workers=4 \
--world-size=2 \
--dist-backend=gloo \
--dist-url=file:///workspace/sharedfile \
--rank=0 \
--print-freq 10 /workspace/imagenet
The program hangs without launching anything when --world-size value is greater than 1. The problem is that the init_process_group never return.
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
world_size=args.world_size, rank=0)
It is actually calling a C++ function. Any help would be appreciated, thanks.
I also tried a bunch of combinations as follow and they all end up with hanging.
--dist-url=tcp://10.1.1.20:23456
--dist-url=tcp://127.0.0.1:29500
--dist-url=file:///workspace/sharedfile
--dist-backend=tcp
--dist-backend=gloo
--dist-backend=nccl
Update: even this simple python test ended up with hanging:
import torch.distributed as dist
dist.init_process_group(backend='gloo', init_method='file:///workspace/sharedfile', world_size=4, rank=0)
print('Hello from process {} (out of {})!'.format(dist.get_rank(), dist.get_world_size()))
There are some utilities included with the container to help launch multi-process/multi-gpu jobs. Could you tell me what container version you are working on as it has slightly changed from release to release.
You could also manually launch the processes by using a command along the lines of
NGPUS=2 for GPU in `seq $NGPUS`; do
python /opt/pytorch/examples/imagenet/main.py \
--arch=resnet50 \
--epochs=1 \
--batch-size=64 \
--lr=0.01 \
--workers=4 \
--world-size=$NGPUS \
--dist-backend=gloo \
--dist-url=file:///workspace/sharedfile \
--rank=$GPU \
--print-freq 10 /workspace/imagenet \
; done
Otherwise, depending on which container you’re on
CUDA_VISIBLE_DEVICES=0,1 python -m /opt/pytorch/examples/imagenet/multiproc
/opt/pytorch/examples/imagenet/main.py \
--arch=resnet50 \
--epochs=1 \
--batch-size=64 \
--lr=0.01 \
--workers=4 \
--dist-backend=gloo \
--dist-url=file:///workspace/sharedfile \
--print-freq 10 /workspace/imagenet \
or
CUDA_VISIBLE_DEVICES=0,1 python -m apex.parallel.multiproc
/opt/pytorch/examples/imagenet/main.py \
--arch=resnet50 \
--epochs=1 \
--batch-size=64 \
--lr=0.01 \
--workers=4 \
--dist-backend=gloo \
--dist-url=file:///workspace/sharedfile \
--print-freq 10 /workspace/imagenet \
should also work.
@csarofeen , the container tag is 18.05-py2.
Installing apex and using
CUDA_VISIBLE_DEVICES=0,1 python -m apex.parallel.multiproc
solves the problem.
Thanks for helping.