Performance degradation across multi-nodes

virsg · January 20, 2020, 9:51pm

System Configuration
2 servers with 4xV100-16GB-SXM2 (total 8 GPU’s)
Framework: TensorFlow
Framework version: TF 1.4
Horovod version: 0.18.2 via Horovod in docker
MPI version: 4.0.0
CUDA version: 10.0
NCCL version: 2.5.6
Python version: 2.7
OS and version: Ubuntu 18.04
GCC version: 4.8
CUDNN version: 7.6.5
Mellanox OFED 4.7-3.2.9.0
GPUDirect RDMA - nvidia-peer-memory_1.0-8

Hi all, I am running tf_benchmarks and the scaling efficiency degrades from 98-96% (single node) to ~87-86% (multi-node) when crossing the nodes, and also there is no performance boost with GPUDIrect RDMA enabled

Resnet50 | Batch size=64| GRPUDirect enabled
Single Node - 1xGPU: ~731 img/sec
Single Node - 4xGPU’s: ~2884 img/sec (~96% scaling efficiency)
Multi Node - 8xGPU’s: ~5048 img/sec (~86% scaling efficiency

Resnet50 | Batch size=64| GRPUDirect disabled:
Single Node - 1xGPU: ~733 img/sec
Single Node - 4xGPU’s: ~2865 img/sec (~98% scaling efficiency)
Multi Node - 8xGPU’s: ~5081 img/sec (~87% scaling efficiency

njuffa · January 20, 2020, 10:21pm

Outside my area of expertise, but I think the following additional data would make it easier for a third party to give advice:

(1) Hardware specifications of the nodes: Server vendor and model, CPU(s), system memory size & speed, mass storage
(2) Hardware specifications of the inter-node interconnect. Mellanox offers a multitude of different products in different performance classes.

Topic		Replies	Views
Performance issues with CUDA 1.1 & 169.09 drivers Performance degradation on OGL interop. CUDA Programming and Performance	4	3792	December 10, 2007
Performance Optimization CUDA Programming and Performance	2	4084	July 4, 2007
Low GPU usage on tensorflow (RTX 3090) Frameworks	1	1794	August 11, 2021
unexpected slow performance CUDA Programming and Performance	0	369	February 29, 2020
Performance tuning on multi-GPU with CUDA CUT_THREADPROC CUDA Programming and Performance	3	1041	October 26, 2010
Multi-GPU training not working Frameworks cuda , tensorflow	0	463	April 12, 2020
Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b CUDA Programming and Performance	0	791	May 9, 2012
Is there performance problem in CUDA and Windows? CUDA Programming and Performance	2	517	March 22, 2017
Running multiple CUDA apps on same GPU card Serious performance drop CUDA Programming and Performance	1	1134	March 14, 2011
slower performance in container when using V100 Frameworks tensorflow	2	1429	June 15, 2018

Performance degradation across multi-nodes

Related topics