I’m testing Amber14 (Molecular Dynamics application) on a workstation with four Tesla K80s (logically 8gpus).
I have two issues:
GPUDirect2.0 P2P transfer function is not available between any physical K80 cards.
The P2P transfer function is only available between two GK210 gpus in each K80 card.
(I doubt that PCI implementation of E5-2600 is not enough for handling P2P transfer on multi-layer PCIe switch.)
When I run Amber14 using two GK210 gpus in one K80 card with P2P transfer function,
it runs, but its performance is extremely slow.
Ex. Amber14 in “DHFR NPT PME 4fs” case:
without P2P (MPI transfer): Performance(ns/day) is 259.12
with P2P: Performance(ns/day) is 6.82
I think Amber14 is not the root cause of these issues,
because I’ve also encountered same phenomenon on another P2P transfer-enabled CUDA application.
Is there anyone with similar issues ?
Here’s specification of my workstation:
Machine: Supermicro 7048GR-TR
CPU: Intel Xeon E5-2698v3 *2
Memory: 128GB DDR4 2133MHz
GPU: Tesla K80 *4
Quoting txbob from an older forum thread: “A requirement for P2P (GPU to GPU transfers in the same server node) is that both GPUs in question must be on the same PCIE root complex. This effectively means they must both be plugged into slots that are serviced by the same CPU socket.”
firstly please forgive the slightly commercial tone of this reply, but as our hardware is being used by Amber folks I thought it relevant.
the type of architecture that you’re looking for is one that our company has been working on for several years. We are now on the recommended hardware list for 8 GPU cards (k80 would be 16 gpu devices) all on a single root complex like TxBob mentioned.
here is a cut/paste of our lspci -tvv for our 8 GPU card box. we have a 4 GPU card server as well. forgive the spacing, i’m not sure how to do that nice paste of text.
each set of 4 GPU cards is connected to our SR3514 (5x16 switch riser, 1 connection to the host, 4 to the cards)
we put either 1 or 2 groups of 4 cards in a single server, and as you can see they are all connected to the same Local root, (02.0)
now all 16 devices (384 GB of Graphics Ram and 49k Cuda Cores) can be GPUDirect peers with each other (up to the 8 limit per peer group)
we’re at SC15 in Austin this week at booth 1627 Cirrascale