GTC 2020: Optimization Strategies for Large-Scale DL Training Workloads: Case Study with RN50 on DGX Clusters

GTC 2020 S21733
Presenters: Mohammad Zulfiqar,NVIDIA; Joshua Mora Acosta, NVIDIA
Abstract
Our tutorial will expose a list of optimizations for large-scale DL training workloads. We’ll give performance metrics and performance modeling of the deep-learning neural network as we scale the run, details on the executions at large scale, hardware subsystem’s performance and software layers, paired with profiling tools (NVPROF,NSYS), NVTX tagging, profile logging considerations, profile parsing, visualizing and analyzing (for example, tradeoffs) the profiled information to identify the opportunities to improve the performance at large scale and to guide and prioritize the optimization efforts. We’ll showcase those optimization strategies on training RN50 on large clusters of DGX1 and DGX2 machines up to 1,500 GPUs, which delivered a 2x performance improvement on the same amount of hardware. You need to be familiar with HW, SW, clusters, MPI, NCCL, profiling, deep-learning training, HPC, and performance metrics.

Watch this session
Join in the conversation below.

Feel free to post your feedback and questions here for everyone to learn. I’ll do my best to answer them promptly.