Hi,
So it turns out the openmpi was in fact detecting the infiniband (I think) as the code does run on multiple nodes (just slow).
I have further tested the code on both PLEIADES at NASA HECC and COMET at SDSC.
COMET is set up to support GPUdirect RDMA but when running under openmpi and PGI,
it seems that this feature is not being supported/activated (at least according to the output of openmpi).
On PLEIADES, there is no support at all for GPUdirect RDMA.
To see if the slow run speeds on PLEIADES is due to the lack of GPUdirect, I ran the
same simulation with the same code on COMET using multiple nodes.
To re-cap, the simulation times on PLEIADES are as follows:
PLEAIDES:
1 NODE, 1xV100: 2421.9
1 NODE, 4xV100: 832.1
1 NODE, 8xV100: 698.1
2 NODES, 4xV100 each: 1798.5
Ignoring the poor scaling of the run to 8 GPUs (the size of the problem is small)
we see that using 8 GPUs with 2 nodes (4 each) is over twice as slow as using 8 GPUs on one node.
Switching over to COMET, we find:
COMET:
1 NODE, 1xP100: 3227.3
1 NODE, 4xP100: 1170.6
2 NODES, 2xP100 each: 1170.7
2 NODES, 4xP100 each: 967.3
4 NODES, 2xP100 each: 923.4
Here we see that using the same number of GPUs on 1 node versus multiple nodes
yields almost the same run-times. In fact, running on 4 nodes with 4 GPUs each,
yields a communication time of 464.3 while running on 2 nodes has a communication time of 428.1.
Therefore, the overhead of communication between nodes is not that bad.
To try to understand what is going on, I checked out the result of using nvidia-smi -topo m.
On COMET, I get:
GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity
GPU0 X PIX SYS SYS PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26
GPU1 PIX X SYS SYS PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26
GPU2 SYS SYS X PIX SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27
GPU3 SYS SYS PIX X SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27
mlx4_0 PHB PHB SYS SYS X
while on PLEIADES, I get:
GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity
GPU0 X NV2 NV2 SYS NODE NODE SYS SYS 0-17
GPU1 NV2 X SYS NV1 PIX PIX SYS SYS 0-17
GPU2 NV2 SYS X NV2 SYS SYS NODE NODE 18-35
GPU3 SYS NV1 NV2 X SYS SYS PIX PIX 18-35
mlx5_0 NODE PIX SYS SYS X PIX SYS SYS
mlx5_1 NODE PIX SYS SYS PIX X SYS SYS
mlx5_2 SYS SYS NODE PIX SYS SYS X PIX
mlx5_3 SYS SYS NODE PIX SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
On COMET, everything seems reasonable in that each GPU sees its partner on the socket as PIX, and the
other 2 GPUs on the other socket as SYS.
However, on PLEAIDES, the results seem strange. It looks like each GPU on the node has
NVlink access to 2 other GPUs and SYS access to the last GPU. Does this imply a 3-way NVlink is being used?
Does this information help in diagnosing the slow speeds when using multiple GPU nodes on PLEIADES?
From the topology, I would not expect using multiple nodes to be much slower than a single node
with the same number of GPUs, since COMET also was not using GPUdirect and shows much better performance over the
network.
I have tried numerous openmpi flags and bindings, but they do not seem to help the run-times.
Thanks,
Ron