Performance degradation across multi-nodes

System Configuration
2 servers with 4xV100-16GB-SXM2 (total 8 GPU’s)
Framework: TensorFlow
Framework version: TF 1.4
Horovod version: 0.18.2 via Horovod in docker
MPI version: 4.0.0
CUDA version: 10.0
NCCL version: 2.5.6
Python version: 2.7
OS and version: Ubuntu 18.04
GCC version: 4.8
CUDNN version: 7.6.5
Mellanox OFED 4.7-3.2.9.0
GPUDirect RDMA - nvidia-peer-memory_1.0-8

Hi all, I am running tf_benchmarks and the scaling efficiency degrades from 98-96% (single node) to ~87-86% (multi-node) when crossing the nodes, and also there is no performance boost with GPUDIrect RDMA enabled

Resnet50 | Batch size=64| GRPUDirect enabled
Single Node - 1xGPU: ~731 img/sec
Single Node - 4xGPU’s: ~2884 img/sec (~96% scaling efficiency)
Multi Node - 8xGPU’s: ~5048 img/sec (~86% scaling efficiency

Resnet50 | Batch size=64| GRPUDirect disabled:
Single Node - 1xGPU: ~733 img/sec
Single Node - 4xGPU’s: ~2865 img/sec (~98% scaling efficiency)
Multi Node - 8xGPU’s: ~5081 img/sec (~87% scaling efficiency

Outside my area of expertise, but I think the following additional data would make it easier for a third party to give advice:

(1) Hardware specifications of the nodes: Server vendor and model, CPU(s), system memory size & speed, mass storage
(2) Hardware specifications of the inter-node interconnect. Mellanox offers a multitude of different products in different performance classes.