Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization

jwitsoe · December 1, 2021, 9:33pm

Originally published at: https://developer.nvidia.com/blog/boosting-mlperf-training-v1-1-performance-with-full-stack-optimization/

In this round of MLPerf training v1.1, optimization across the entire stack including hardware, system software, libraries, and algorithms continue to boost NVIDIA MLPerf training performance.

arittenb · February 23, 2022, 9:57pm

Hi, I recently came across this work and think that it is very cool! One thing that I was wondering, is there anywhere that profile traces for the different training pipelines are available, using either something like nvprof or tensorboard? I’m interested in seeing how much time is spent passing data from node to node when training with large number of nodes (like the 540 node case for ResNet50).

mahmood.nt · April 3, 2022, 8:32pm

Hi,
I am trying to run Minigo on a single 3080 device with the docker and readme files. However, I see some errors that can not fix them with your help.

I have followed the steps in [1] and all was good for setting up the docker image… As I run the run command like

CONT="mlperf-nvidia:minigo" DATADIR=`pwd`/../ LOGDIR=`pwd`/../ bash run_with_docker.sh

I get the following error:

+ sync
+ sudo /sbin/sysctl vm.drop_caches=3
vm.drop_caches = 3
+ docker exec -it minigo python -c '
from mlperf_logging.mllog import constants
from mlperf_log_utils import log_event
log_event(key=constants.CACHE_CLEAR, value=True)'
:::MLLOG {"namespace": "", "time_ms": 1649017488579, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "/opt/reinforcement/minigo/ml_perf/logger.py", "lineno": 27}}
+ export SEED=618
+ SEED=618
+ docker exec -it --env=CONCURRENT_GAMES --env=DATADIR --env=DGXHT --env=DGXNGPU --env=DGXNNODES --env=DGXNSOCKET --env=DGXSOCKETCORES --env=DGXSYSTEM --env=HOROVOD_CYCLE_TIME --env=HOROVOD_FUSION_THRESHOLD --env=HOROVOD_NUM_STREAMS --env=NUM_GPUS_TRAIN --env=NUM_ITERATIONS --env=PA_INFERENCE --env=PA_SEARCH --env=PROCS_PER_GPU --env=SP_THREADS --env=TF_CPP_MIN_LOG_LEVEL --env=WALLTIME --env=SLURM_NTASKS_PER_NODE minigo bash -c 'mpirun --allow-run-as-root -np 2 ./run_and_time.sh'
./run_and_time.sh: line 52: [: : integer expression expected
./run_and_time.sh: line 52: [: : integer expression expected
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[LOG_CAT_ML] You must specify a valid HCA device by setting:
-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.
If no device was specified for HCOLL (or the calling library), automatic device detection will be run.
In case of unfounded HCA device please contact your system administrator.
[rtx3080:00159] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[rtx3080:00160] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
Traceback (most recent call last):
  File "./ml_perf/train_loop.py", line 995, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "./ml_perf/train_loop.py", line 864, in main
    selfplay_ranks, selfplay_data_transfer_ranks = distribute_mpiranks(rank, size)
  File "./ml_perf/train_loop.py", line 726, in distribute_mpiranks
    node_of_first_selfplay_rank = selfplay_ranks[0] // FLAGS.ranks_per_node
IndexError: list index out of range
Traceback (most recent call last):
  File "./ml_perf/train_loop.py", line 995, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "./ml_perf/train_loop.py", line 864, in main
    selfplay_ranks, selfplay_data_transfer_ranks = distribute_mpiranks(rank, size)
  File "./ml_perf/train_loop.py", line 726, in distribute_mpiranks
    node_of_first_selfplay_rank = selfplay_ranks[0] // FLAGS.ranks_per_node
IndexError: list index out of range
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[11367,1],1]
  Exit code:    1
--------------------------------------------------------------------------
+ set -eux
+ cleanup_docker
+ docker container rm -f minigo
minigo

Looking at the code of train_loop.py and printing variables

    selfplay_ranks = [r for r in list(range(0, size)) if r not in (train_ranks + n_train_ranks + data_transfer_ranks)]
    print("selfplay_ranks = ", selfplay_ranks)
    print("size = ", size)
    print("train_ranks = ", train_ranks)
    print("n_train_ranks = ", n_train_ranks)
    print("data_transfer_ranks = ", data_transfer_ranks)

Shows that

selfplay_ranks =  []
size =  2
train_ranks =  [0]
n_train_ranks =  [0]
data_transfer_ranks =  [1]
selfplay_ranks =  []
size =  2
train_ranks =  [0]
n_train_ranks =  [0]
data_transfer_ranks =  [1]

I also have set the following parameters

## System run parms
export DGXNNODES=1
export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )
export WALLTIME=07:30:00

## System config params
export DGXNGPU=1
export DGXSOCKETCORES=1
export DGXNSOCKET=1
export DGXHT=2 	# HT is on is 2, HT off is 1

Any idea about that? Any comment is appreciated.

[1] training_results_v1.1/NVIDIA/benchmarks/minigo/implementations/tensorflow at main · mlcommons/training_results_v1.1 · GitHub

Topic		Replies	Views
The Full Stack Optimization Powering NVIDIA MLPerf Training v2.0 Performance Technical Blog	0	413	June 29, 2022
Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand Technical Blog	0	330	November 8, 2023
Tuning AI Infrastructure Performance with MLPerf HPC v2.0 Benchmarks Technical Blog	0	384	November 9, 2022
MLPerf v1.0 Training Benchmarks: Insights into a Record-Setting NVIDIA Performance Technical Blog	1	526	August 4, 2021
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11315	May 23, 2010
Leading MLPerf Inference v3.1 Results with NVIDIA GH200 Grace Hopper Superchip Debut Technical Blog	1	462	October 3, 2023
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8705	September 22, 2010
Convincing skeptical bigwigs on the future of CUDA CUDA Programming and Performance	49	8722	March 19, 2009
Is there a performance issue in the Release 19.05 ? Frameworks (archived) tensorflow	3	509	June 27, 2019
MLPerf HPC v1.0: Deep Dive into Optimizations Leading to Record-Setting NVIDIA Performance Technical Blog	1	656	November 5, 2022

Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization

Related topics