Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization

Originally published at: https://developer.nvidia.com/blog/boosting-mlperf-training-v1-1-performance-with-full-stack-optimization/

In this round of MLPerf training v1.1, optimization across the entire stack including hardware, system software, libraries, and algorithms continue to boost NVIDIA MLPerf training performance.

Hi, I recently came across this work and think that it is very cool! One thing that I was wondering, is there anywhere that profile traces for the different training pipelines are available, using either something like nvprof or tensorboard? I’m interested in seeing how much time is spent passing data from node to node when training with large number of nodes (like the 540 node case for ResNet50).

Hi,
I am trying to run Minigo on a single 3080 device with the docker and readme files. However, I see some errors that can not fix them with your help.

I have followed the steps in [1] and all was good for setting up the docker image… As I run the run command like

CONT="mlperf-nvidia:minigo" DATADIR=`pwd`/../ LOGDIR=`pwd`/../ bash run_with_docker.sh

I get the following error:

+ sync
+ sudo /sbin/sysctl vm.drop_caches=3
vm.drop_caches = 3
+ docker exec -it minigo python -c '
from mlperf_logging.mllog import constants
from mlperf_log_utils import log_event
log_event(key=constants.CACHE_CLEAR, value=True)'
:::MLLOG {"namespace": "", "time_ms": 1649017488579, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/opt/reinforcement/minigo/ml_perf/logger.py", "lineno": 27}}
+ export SEED=618
+ SEED=618
+ docker exec -it --env=CONCURRENT_GAMES --env=DATADIR --env=DGXHT --env=DGXNGPU --env=DGXNNODES --env=DGXNSOCKET --env=DGXSOCKETCORES --env=DGXSYSTEM --env=HOROVOD_CYCLE_TIME --env=HOROVOD_FUSION_THRESHOLD --env=HOROVOD_NUM_STREAMS --env=NUM_GPUS_TRAIN --env=NUM_ITERATIONS --env=PA_INFERENCE --env=PA_SEARCH --env=PROCS_PER_GPU --env=SP_THREADS --env=TF_CPP_MIN_LOG_LEVEL --env=WALLTIME --env=SLURM_NTASKS_PER_NODE minigo bash -c 'mpirun --allow-run-as-root -np 2 ./run_and_time.sh'
./run_and_time.sh: line 52: [: : integer expression expected
./run_and_time.sh: line 52: [: : integer expression expected
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[rtx3080:00159] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[rtx3080:00160] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
Traceback (most recent call last):
  File "./ml_perf/train_loop.py", line 995, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "./ml_perf/train_loop.py", line 864, in main
    selfplay_ranks, selfplay_data_transfer_ranks = distribute_mpiranks(rank, size)
  File "./ml_perf/train_loop.py", line 726, in distribute_mpiranks
    node_of_first_selfplay_rank = selfplay_ranks[0] // FLAGS.ranks_per_node
IndexError: list index out of range
Traceback (most recent call last):
  File "./ml_perf/train_loop.py", line 995, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "./ml_perf/train_loop.py", line 864, in main
    selfplay_ranks, selfplay_data_transfer_ranks = distribute_mpiranks(rank, size)
  File "./ml_perf/train_loop.py", line 726, in distribute_mpiranks
    node_of_first_selfplay_rank = selfplay_ranks[0] // FLAGS.ranks_per_node
IndexError: list index out of range
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[11367,1],1]
  Exit code:    1
--------------------------------------------------------------------------
+ set -eux
+ cleanup_docker
+ docker container rm -f minigo
minigo

Looking at the code of train_loop.py and printing variables

    selfplay_ranks = [r for r in list(range(0, size)) if r not in (train_ranks + n_train_ranks + data_transfer_ranks)]
    print("selfplay_ranks = ", selfplay_ranks)
    print("size = ", size)
    print("train_ranks = ", train_ranks)
    print("n_train_ranks = ", n_train_ranks)
    print("data_transfer_ranks = ", data_transfer_ranks)

Shows that

selfplay_ranks =  []
size =  2
train_ranks =  [0]
n_train_ranks =  [0]
data_transfer_ranks =  [1]
selfplay_ranks =  []
size =  2
train_ranks =  [0]
n_train_ranks =  [0]
data_transfer_ranks =  [1]

I also have set the following parameters

## System run parms
export DGXNNODES=1
export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )
export WALLTIME=07:30:00

## System config params
export DGXNGPU=1
export DGXSOCKETCORES=1
export DGXNSOCKET=1
export DGXHT=2 	# HT is on is 2, HT off is 1

Any idea about that? Any comment is appreciated.

[1] training_results_v1.1/NVIDIA/benchmarks/minigo/implementations/tensorflow at main · mlcommons/training_results_v1.1 · GitHub