One GPU NOT capable of Peer-to-Peer (P2P)

Hello,

I have two exactly the same graphic cards on my computer. I found that using two GPUs make my program slower than using only one, and then I doubt whether there are any setting problem. Here are some information of my computer:

Model: Dell Precision Tower 5810
OS: Windows 10
Graphic Cards: 2 * NVidia Quadro M4000
Cuda: 8.0
cuDNN: 5.1

I run the simpleP2P test and the results are:

[simpleP2P.exe] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = " Quadro M4000" IS capable of Peer-to-Peer (P2P)
GPU1 = " Quadro M4000" NOT capable of Peer-to-Peer (P2P)

Two or more GPUs with SM 2.0 or higher capability are required for simpleP2P.exe.

Also, a TCC driver must be installed and enabled to run simpleP2P.exe.

What should I do to solve the above problem and make two GPUs working faster than using only one GPU? Thank you.

There is no topo option in my nvidia-smi command, therefore, I use -a to list as much as possible as below.

==============NVSMI LOG==============

Timestamp : Wed Apr 12 13:12:54 2017
Driver Version : 376.51

Attached GPUs : 2
GPU 0000:03:00.0
Product Name : Quadro M4000
Product Brand : Quadro
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : WDDM
Pending : WDDM
Serial Number : 0325016106824
GPU UUID : GPU-035131cf-9127-b14a-eccb-bdc741376f02
Minor Number : N/A
VBIOS Version : 84.04.88.00.06
MultiGPU Board : No
Board ID : 0x300
GPU Part Number : 900-5G400-0100-000
Inforom Version
Image Version : G400.0501.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x13F110DE
Bus Id : 0000:03:00.0
Sub System Id : 0x115310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 8000 KB/s
Rx Throughput : 3000 KB/s
Fan Speed : 50 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 8192 MiB
Used : 7019 MiB
Free : 1173 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 229 MiB
Free : 27 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 6 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 50 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 99 C
Power Readings
Power Management : Supported
Power Draw : 22.25 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 10.00 W
Max Power Limit : 120.00 W
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 324 MHz
Video : 405 MHz
Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Default Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Max Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3005 MHz
Video : 710 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
Process ID : 292
Type : C+G
Name : C:\Program Files (x86)\Internet Explorer\iexplore.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 1352
Type : Insufficient Permissions
Name : Insufficient Permissions
Used GPU Memory : Not available in WDDM driver model
Process ID : 3880
Type : C+G
Name : C:\Windows\explorer.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 7560
Type : C+G
Name : C:\Windows\SystemApps\ShellExperienceHost_cw5n1h2txyewy\ShellExperienceHost.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 8252
Type : C+G
Name : C:\Program Files (x86)\Microsoft Office\Office16\OUTLOOK.EXE
Used GPU Memory : Not available in WDDM driver model
Process ID : 8936
Type : C+G
Name : C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 9212
Type : C+G
Name : C:\Windows\SystemApps\Microsoft.Windows.Cortana_cw5n1h2txyewy\SearchUI.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 10644
Type : C+G
Name : C:\Program Files\WindowsApps\Microsoft.WindowsCalculator_10.1604.21020.0_x64__8wekyb3d8bbwe\Calculator.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 10700
Type : C+G
Name : C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\devenv.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 10880
Type : C+G
Name : C:\Windows\System32\ApplicationFrameHost.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 11024
Type : C
Name : C:\Users\brandon5\AppData\Local\Programs\Python\Python35\python.exe
Used GPU Memory : Not available in WDDM driver model

GPU 0000:04:00.0
Product Name : Quadro M4000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : TCC
Pending : TCC
Serial Number : 0323216038356
GPU UUID : GPU-c609aa7a-58ff-2ac3-06ba-7e563761c5f9
Minor Number : N/A
VBIOS Version : 84.04.88.00.06
MultiGPU Board : No
Board ID : 0x400
GPU Part Number : N/A
Inforom Version
Image Version : G400.0501.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x13F110DE
Bus Id : 0000:04:00.0
Sub System Id : 0x115310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 25000 KB/s
Rx Throughput : 68000 KB/s
Fan Speed : 65 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 8121 MiB
Used : 7826 MiB
Free : 295 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 74 %
Memory : 42 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 79 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 99 C
Power Readings
Power Management : Supported
Power Draw : 98.35 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 10.00 W
Max Power Limit : 120.00 W
Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3004 MHz
Video : 712 MHz
Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Default Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Max Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3005 MHz
Video : 710 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
Process ID : 11024
Type : C
Name : C:\Users\brandon5\AppData\Local\Programs\Python\Python35\python.exe
Used GPU Memory : 7826 MiB

From the output you posted:

The GPU that is not supporting P2P is the GPU that is in WDDM mode. You can switch it to TCC mode using nvidia-smi (use nvidia-smi --help to get command-line help) but you won’t be able to do that if the GPU is supporting the system primary display, and a GPU in TCC mode cannot support a display.

Thank you for your response, txbob. I see. I still need one of them to support the display anyway. In this case, do I still have the benefit to set one to TCC mode; or will the performance drop/similar/improve if both are set to WDDM mode? Thank you.

If both GPUs are in WDDM mode you won’t be able to use P2P communication. As txbob says, P2P requires TCC, while the Windows GUI requires WDDM. That is because a device with a TCC driver is seen by the OS as a 3D controller, and as such is not able to drive the GUI. There are of course good performance reasons to use TCC for GPUs used for compute, independent of P2P. It seems that is the setup you have already.

For Windows systems where the predominant use of GPUs is for compute, it is best to use TCC mode for all (high-end) GPUs that support it, and leave one (low-end) GPU in WDDM mode for the GUI. I have in the past used the cheapest available Quadros for GUI support together with high-end Teslas or Quadros for compute. This worked for me because I had minimal visualization needs. Your use case may differ. The cheapest Quadros are typically around $130, the current SKU in that role is the Quadro P400.

Thank you for your information, njuffa.

@Moderator: I know this is very old, why does P2P not work between TCC and WDDM GPUs? There must be some industry need to process data on a TCC GPU to then want to move that data directly to a WDDM GPU for display purposes - why force industry to do a double-copy of the data when the ability via PCIe is there to perform a DMA.

According to online posts on various forums, TCC <–> WDDM data transfer is not supported with P2P, GpuDirect or NVLINK, what gives?

The WDDM driver model is provided by Microsoft and defined by them, not NVIDIA. Limitations on feature support in a WDDM system arise from this characteristic.

Interestingly, the CUDA 10 drivers seem to announce support for this:
Added support for peer-to-peer (P2P) with CUDA on Windows (WDDM 2.0+ only).

Yes, that is a new feature in CUDA 10 drivers.

But does it support P2P between a TCC and a WDDM GPU?

  • TCC <--> TCC: Yes (CUDA8, maybe earlier?)
  • WDDM <--> WDDM: Yes (CUDA10)
  • TCC <--> WDDM: ???

So I gave this a shot and peer-to-peer did not work for me on a Win10 with the latest NVIDIA driver between two GPUs (M4000 and M5000) both in WDDM mode (2.1).

Needless to say, it didn’t work with one GPU in TCC either.

I installed the CUDA 10 toolkit, Downloaded and installed the latest 64bit Win10 NVIDIA driver on the Win10 machine. Built a CUDA application (confirmed it used nvcc from the CUDA10 toolkit). Ran it on the Win10 machine and the cudaDeviceCanAccessPeer returned false in both directions.

Topologically, you need a machine where both GPUs are on the same fabric, as well. This has always been the case for P2P support.

Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

You just spoke over my head :-)

I am using an HP Z640 Workstation with the HP 710325-002 Motherboard.

  • Quadro M5000 Slot 2 PCIe3 x16
  • Quadro M4000 Slot 5 PCIe3 x16

Hypothetically, what if there was a PCIe bridge between the GPUs?

  • I imagine this is OK since the Peer-to-peer data transfer is still DMA.

Not my area of expertise, but P2P support has traditionally required the GPUs to be endpoints of the same PCIe root complex. The PCIe root complex is typically part of the CPU these days. So the easiest explanation for your observation is that your system is a dual CPU socket platform, where the two GPUs currently are in slots served by different root complexes. Does that apply to your workstation, by any chance? I am not familiar with the HP SKU you mentioned.

What is the relationship between a PCIe-to-PCIe bridge and a PCIe root complex, if any? Is the bridge completely transparent (suitable for extending a PCIe root complex)? Does your workstation feature a PCIe bridge?

Also did not work with two Quadro P2000’s. cudaDeviceCanAccessPeer returned false in both directions.

Setup:

  • HP Z640 Workstation with the HP 710325-002 Motherboard.
  • A single Intel Xeon E5-1650 V4
  • Windows 10 (10.0 Build 14393) Enterprise 2016, 64-bit
  • Both in WDDM (2.1).
  • Driver: 411.81-quadro-grid-desktop-notebook-win10-64bit-international-whql.
  • CUDA10: cuda_10.0.130_411.31_windows.
  • CUDA10 DLL: cudart64_100.dll.

The HP Z640 Workstation does not have a PCIe bridge.

  • It has an Intel® Xeon® Processor E5-1650 v4 installed.
  • 6 cores, 12 Threads.
  • I don’t have an answer to your root-complex question. Will try to find out.

    @Robert_Crovella: Would you happen to know if this new CUDA10/Win10/WDDM2.x feature is limited to NVLink as the communication between GPUs and not PCIe? Or will it work over PCIe as well?

    Thanks

    Looking at Intel ARK, the E5-1650 v4 is a processor designed for 1S systems, and sports 40 PCIe lanes. I have no further hypotheses why P2P doesn’t work (as I said, not my area of expertise).

    In practical terms, if you are desperate enough, you could try the GPUs in different PCIe slots to check whether that makes a difference.

    I know P2P is working while the two GPUs are in TCC mode - I added a third GPU and changes the two P2000’s to TCC.
    – This tells me that physical layer isn’t the issue.

    I am guessing that the capability truly hasn’t been enabled in CUDA10, the current version of 10 that is. Maybe NVIDIA will enable it in version 10.x.x.x.