ImageNet hang on DGX-1 when using multiple GPUs.

Hello I tried running resnet50 and imagenet using the following commands:

python /opt/pytorch/examples/imagenet/main.py \
    --arch=resnet50 \
    --epochs=1 \
    --batch-size=64 \ 
    --lr=0.01 \
    --workers=4 \
    --world-size=2 \ 
    --dist-backend=gloo \
    --dist-url=file:///workspace/sharedfile \
    --rank=0 \
    --print-freq 10 /workspace/imagenet

The program hangs without launching anything when --world-size value is greater than 1. The problem is that the init_process_group never return.

dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, 
                            world_size=args.world_size, rank=0)

It is actually calling a C++ function. Any help would be appreciated, thanks.

I also tried a bunch of combinations as follow and they all end up with hanging.

--dist-url=tcp://10.1.1.20:23456
--dist-url=tcp://127.0.0.1:29500 
--dist-url=file:///workspace/sharedfile 

--dist-backend=tcp
--dist-backend=gloo
--dist-backend=nccl

Update: even this simple python test ended up with hanging:

import torch.distributed as dist
    dist.init_process_group(backend='gloo', init_method='file:///workspace/sharedfile', world_size=4, rank=0)    
    print('Hello from process {} (out of {})!'.format(dist.get_rank(), dist.get_world_size()))

There are some utilities included with the container to help launch multi-process/multi-gpu jobs. Could you tell me what container version you are working on as it has slightly changed from release to release.

You could also manually launch the processes by using a command along the lines of

NGPUS=2 for GPU in `seq $NGPUS`; do
        python /opt/pytorch/examples/imagenet/main.py \
        --arch=resnet50 \
        --epochs=1 \
        --batch-size=64 \ 
        --lr=0.01 \
        --workers=4 \
        --world-size=$NGPUS \ 
        --dist-backend=gloo \
        --dist-url=file:///workspace/sharedfile \
        --rank=$GPU \
        --print-freq 10 /workspace/imagenet \
    ; done

Otherwise, depending on which container you’re on

CUDA_VISIBLE_DEVICES=0,1 python -m /opt/pytorch/examples/imagenet/multiproc
        /opt/pytorch/examples/imagenet/main.py \
        --arch=resnet50 \
        --epochs=1 \
        --batch-size=64 \ 
        --lr=0.01 \
        --workers=4 \
        --dist-backend=gloo \
        --dist-url=file:///workspace/sharedfile \
        --print-freq 10 /workspace/imagenet \

or

CUDA_VISIBLE_DEVICES=0,1 python -m apex.parallel.multiproc
        /opt/pytorch/examples/imagenet/main.py \
        --arch=resnet50 \
        --epochs=1 \
        --batch-size=64 \ 
        --lr=0.01 \
        --workers=4 \
        --dist-backend=gloo \
        --dist-url=file:///workspace/sharedfile \
        --print-freq 10 /workspace/imagenet \

should also work.

@csarofeen , the container tag is 18.05-py2.

Installing apex and using

CUDA_VISIBLE_DEVICES=0,1 python -m apex.parallel.multiproc

solves the problem.

Thanks for helping.