ImageNet hang on DGX-1 when using multiple GPUs.

yanchao2012 · June 4, 2018, 2:03am

Hello I tried running resnet50 and imagenet using the following commands:

python /opt/pytorch/examples/imagenet/main.py \
    --arch=resnet50 \
    --epochs=1 \
    --batch-size=64 \ 
    --lr=0.01 \
    --workers=4 \
    --world-size=2 \ 
    --dist-backend=gloo \
    --dist-url=file:///workspace/sharedfile \
    --rank=0 \
    --print-freq 10 /workspace/imagenet

The program hangs without launching anything when --world-size value is greater than 1. The problem is that the init_process_group never return.

dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, 
                            world_size=args.world_size, rank=0)

It is actually calling a C++ function. Any help would be appreciated, thanks.

I also tried a bunch of combinations as follow and they all end up with hanging.

--dist-url=tcp://10.1.1.20:23456
--dist-url=tcp://127.0.0.1:29500 
--dist-url=file:///workspace/sharedfile 

--dist-backend=tcp
--dist-backend=gloo
--dist-backend=nccl

Update: even this simple python test ended up with hanging:

import torch.distributed as dist
    dist.init_process_group(backend='gloo', init_method='file:///workspace/sharedfile', world_size=4, rank=0)    
    print('Hello from process {} (out of {})!'.format(dist.get_rank(), dist.get_world_size()))

csarofeen · June 5, 2018, 4:44pm

There are some utilities included with the container to help launch multi-process/multi-gpu jobs. Could you tell me what container version you are working on as it has slightly changed from release to release.

You could also manually launch the processes by using a command along the lines of

NGPUS=2 for GPU in `seq $NGPUS`; do
        python /opt/pytorch/examples/imagenet/main.py \
        --arch=resnet50 \
        --epochs=1 \
        --batch-size=64 \ 
        --lr=0.01 \
        --workers=4 \
        --world-size=$NGPUS \ 
        --dist-backend=gloo \
        --dist-url=file:///workspace/sharedfile \
        --rank=$GPU \
        --print-freq 10 /workspace/imagenet \
    ; done

Otherwise, depending on which container you’re on

CUDA_VISIBLE_DEVICES=0,1 python -m /opt/pytorch/examples/imagenet/multiproc
        /opt/pytorch/examples/imagenet/main.py \
        --arch=resnet50 \
        --epochs=1 \
        --batch-size=64 \ 
        --lr=0.01 \
        --workers=4 \
        --dist-backend=gloo \
        --dist-url=file:///workspace/sharedfile \
        --print-freq 10 /workspace/imagenet \

or

CUDA_VISIBLE_DEVICES=0,1 python -m apex.parallel.multiproc
        /opt/pytorch/examples/imagenet/main.py \
        --arch=resnet50 \
        --epochs=1 \
        --batch-size=64 \ 
        --lr=0.01 \
        --workers=4 \
        --dist-backend=gloo \
        --dist-url=file:///workspace/sharedfile \
        --print-freq 10 /workspace/imagenet \

should also work.

yanchao2012 · June 5, 2018, 5:24pm

@csarofeen , the container tag is 18.05-py2.

Installing apex and using

CUDA_VISIBLE_DEVICES=0,1 python -m apex.parallel.multiproc

solves the problem.

Thanks for helping.

Topic		Replies	Views
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32501	December 13, 2010
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3556	March 10, 2011
accelerate a single loop with mpi and gpu Legacy PGI Compilers	21	16051	July 19, 2013
About two or more GPUs Legacy PGI Compilers	6	7209	July 31, 2012
Training multiple models on multiple GPUs hangs Frameworks (archived) pytorch	0	857	February 19, 2021
Horovod using only a gpu no matter what np value? Deep Learning (Training & Inference)	0	293	July 8, 2020
tensorflow:19.12-tf2-py3 no multiple gpus Frameworks (archived) tensorflow	0	613	January 2, 2020
Tesla V100 Performance (AWS-P3X16) Frameworks (archived) tensorflow	0	981	June 22, 2018
To run with multigpu, it will be killed,why? Not enough RAM OR CPU resource? TAO Toolkit	2	361	October 12, 2021
Using GPU in 2 processes (keras) in parallel - crash Jetson AGX Xavier	2	839	October 18, 2021

ImageNet hang on DGX-1 when using multiple GPUs.

Related topics