SimpleP2P failed using Tesla K80, Windows server 2012 R2, HP DL388

Sorry for bothering. I am using Tesla K80 and the simpleP2P failed. The problem is the same as:

https://devtalk.nvidia.com/default/topic/883054/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/

I searched on the internet and found that this problem may be solved by disabling ACSCtl. However, I am using Windows server 2012 R2. Is there any solution for windows? many thanks!

[C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0

[C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\0_Simple\simpleP2P…/…/bi
n/win64/Debug/simpleP2P.exe] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 4

GPU0 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU2 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU3 = " Tesla K80" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access…

Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : No
Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : Yes
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1…
Checking GPU0 and GPU1 for UVA capabilities…
Tesla K80 (GPU0) supports UVA: Yes
Tesla K80 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling…
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)…
Creating event handles…
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s
Preparing host buffer and memcpy to GPU0…
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1…
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0…
Copy data back to host from GPU0 and verify results…
Verification error @ element 0: val = 1.#QNAN0, ref = 0.000000
Verification error @ element 1: val = 1.#QNAN0, ref = 4.000000
Verification error @ element 2: val = 1.#QNAN0, ref = 8.000000
Verification error @ element 3: val = 1.#QNAN0, ref = 12.000000
Verification error @ element 4: val = 1.#QNAN0, ref = 16.000000
Verification error @ element 5: val = 1.#QNAN0, ref = 20.000000
Verification error @ element 6: val = 1.#QNAN0, ref = 24.000000
Verification error @ element 7: val = 1.#QNAN0, ref = 28.000000
Verification error @ element 8: val = 1.#QNAN0, ref = 32.000000
Verification error @ element 9: val = 1.#QNAN0, ref = 36.000000
Verification error @ element 10: val = 1.#QNAN0, ref = 40.000000
Verification error @ element 11: val = 1.#QNAN0, ref = 44.000000
Disabling peer access…
Shutting down…
Test failed!

_Simple\simpleP2P\../../bi
n/win64/Debug/simpleP2P.exe] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
> GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU2 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU3 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : No
> Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : No
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : No
> Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : No
> Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : Yes
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : No
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : No
> Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K80 (GPU0) supports UVA: Yes
> Tesla K80 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 1.05GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = 1.#QNAN0, ref = 0.000000
Verification error @ element 1: val = 1.#QNAN0, ref = 4.000000
Verification error @ element 2: val = 1.#QNAN0, ref = 8.000000
Verification error @ element 3: val = 1.#QNAN0, ref = 12.000000
Verification error @ element 4: val = 1.#QNAN0, ref = 16.000000
Verification error @ element 5: val = 1.#QNAN0, ref = 20.000000
Verification error @ element 6: val = 1.#QNAN0, ref = 24.000000
Verification error @ element 7: val = 1.#QNAN0, ref = 28.000000
Verification error @ element 8: val = 1.#QNAN0, ref = 32.000000
Verification error @ element 9: val = 1.#QNAN0, ref = 36.000000
Verification error @ element 10: val = 1.#QNAN0, ref = 40.000000
Verification error @ element 11: val = 1.#QNAN0, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!

You’re claiming that this is the same issue. I doubt that it is.

Your output indicates you actually do have peer access between each pair of GPU devices that comprise a single K80 product. That means that P2P is working within a K80, but not between K80’s. This is exactly what would happen if one K80 is connected to one CPU socket, and the other K80 is connected to another CPU socket. In such a scenario, P2P is not possible between devices on separate CPU sockets. My guess is that is what is happening here.

Thank you for your response!
Actually, when I am building CNN using 2 GPU including gpu0 and gpu1, OR gpu2 and gpu3, I will get a NaN error. However, if I use gpu0 and gpu2, gpu0 and gpu3, gpu1 and gpu2 OR gpu1 and gpu3, the program will be fine. So I guess I still cannot get the access between 2 GPUs in a single GPU device.

Sorry, I agree with your assessment. It does look similar.

In that case, you probably have a few options:

  1. make sure you have the latest system BIOS loaded for that system
  2. If that doesn’t fix it, bring it to the attention of your system vendor (HPE)
  3. Alternatively if you want to pursue a hacked fix, you could try locating windows utilities that allow for direct setting of data in PCI config space. Here is one example that popped up for me with google:

https://eternallybored.org/misc/pciutils/

(I haven’t tried it. There may be other better ones out there)

In that case you could try following a similar method to the forum thread you linked. However, even if it works, it will not survive a reboot, so it’s really not a very practical fix.

Much appreciated for your help!

I have struggled a few days…I will try your suggestions, hope to fix it.

This is the output of lspci

D:\IMSN-H\pciutils-3.5.5-win64>lspci |findstr PLX
0b:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0c:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0c:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)

This is the output of lspci -s 0c:08.0 -vvvv

D:\IMSN-H\pciutils-3.5.5-win64>lspci -s 0c:08.0 -vvvv
0c:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) (prog-if 00 [Normal decode])
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 0
        Bus: primary=0c, secondary=0d, subordinate=0d, sec-latency=0
        I/O behind bridge: 0000f000-00000fff [empty]
        Memory behind bridge: 94000000-94ffffff 
        Prefetchable memory behind bridge: 0000039800000000-0000039c01ffffff 
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] MSI: Enable+ Count=8/8 Maskable+ 64bit+
                Address: 00000000fee00518  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
                        Slot #0, PowerLimit 25.000W; Interlock- NoCompl-
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
                        Changed: MRL- PresDet+ LinkState+
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Via message ARIFwd+
                         AtomicOpsCap: Routing+
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled ARIFwd-
                         AtomicOpsCtl: EgressBlck+
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [a4] Subsystem: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

I can’t find ACS…

any update on this?

I’m faced with exactly the same issue.
and neither can I find ACSctl.

(is it BridgeCtl instead?)

The proper way to handle this is to get a system BIOS update from the system vendor that fixes the issue.