One GPU NOT capable of Peer-to-Peer (P2P)

brandon5 · April 12, 2017, 8:48pm

Hello,

I have two exactly the same graphic cards on my computer. I found that using two GPUs make my program slower than using only one, and then I doubt whether there are any setting problem. Here are some information of my computer:

Model: Dell Precision Tower 5810
OS: Windows 10
Graphic Cards: 2 * NVidia Quadro M4000
Cuda: 8.0
cuDNN: 5.1

I run the simpleP2P test and the results are:

[simpleP2P.exe] - Starting…
Checking for multiple GPUs…
CUDA-capable device count: 2

GPU0 = " Quadro M4000" IS capable of Peer-to-Peer (P2P)
GPU1 = " Quadro M4000" NOT capable of Peer-to-Peer (P2P)

Two or more GPUs with SM 2.0 or higher capability are required for simpleP2P.exe.

Also, a TCC driver must be installed and enabled to run simpleP2P.exe.

What should I do to solve the above problem and make two GPUs working faster than using only one GPU? Thank you.

There is no topo option in my nvidia-smi command, therefore, I use -a to list as much as possible as below.

==============NVSMI LOG==============

Timestamp : Wed Apr 12 13:12:54 2017
Driver Version : 376.51

Attached GPUs : 2
GPU 0000:03:00.0
Product Name : Quadro M4000
Product Brand : Quadro
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : WDDM
Pending : WDDM
Serial Number : 0325016106824
GPU UUID : GPU-035131cf-9127-b14a-eccb-bdc741376f02
Minor Number : N/A
VBIOS Version : 84.04.88.00.06
MultiGPU Board : No
Board ID : 0x300
GPU Part Number : 900-5G400-0100-000
Inforom Version
Image Version : G400.0501.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x13F110DE
Bus Id : 0000:03:00.0
Sub System Id : 0x115310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 8000 KB/s
Rx Throughput : 3000 KB/s
Fan Speed : 50 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 8192 MiB
Used : 7019 MiB
Free : 1173 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 229 MiB
Free : 27 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 6 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 50 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 99 C
Power Readings
Power Management : Supported
Power Draw : 22.25 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 10.00 W
Max Power Limit : 120.00 W
Clocks
Graphics : 135 MHz
SM : 135 MHz
Memory : 324 MHz
Video : 405 MHz
Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Default Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Max Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3005 MHz
Video : 710 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
Process ID : 292
Type : C+G
Name : C:\Program Files (x86)\Internet Explorer\iexplore.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 1352
Type : Insufficient Permissions
Name : Insufficient Permissions
Used GPU Memory : Not available in WDDM driver model
Process ID : 3880
Type : C+G
Name : C:\Windows\explorer.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 7560
Type : C+G
Name : C:\Windows\SystemApps\ShellExperienceHost_cw5n1h2txyewy\ShellExperienceHost.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 8252
Type : C+G
Name : C:\Program Files (x86)\Microsoft Office\Office16\OUTLOOK.EXE
Used GPU Memory : Not available in WDDM driver model
Process ID : 8936
Type : C+G
Name : C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 9212
Type : C+G
Name : C:\Windows\SystemApps\Microsoft.Windows.Cortana_cw5n1h2txyewy\SearchUI.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 10644
Type : C+G
Name : C:\Program Files\WindowsApps\Microsoft.WindowsCalculator_10.1604.21020.0_x64__8wekyb3d8bbwe\Calculator.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 10700
Type : C+G
Name : C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\devenv.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 10880
Type : C+G
Name : C:\Windows\System32\ApplicationFrameHost.exe
Used GPU Memory : Not available in WDDM driver model
Process ID : 11024
Type : C
Name : C:\Users\brandon5\AppData\Local\Programs\Python\Python35\python.exe
Used GPU Memory : Not available in WDDM driver model

GPU 0000:04:00.0
Product Name : Quadro M4000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : TCC
Pending : TCC
Serial Number : 0323216038356
GPU UUID : GPU-c609aa7a-58ff-2ac3-06ba-7e563761c5f9
Minor Number : N/A
VBIOS Version : 84.04.88.00.06
MultiGPU Board : No
Board ID : 0x400
GPU Part Number : N/A
Inforom Version
Image Version : G400.0501.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x13F110DE
Bus Id : 0000:04:00.0
Sub System Id : 0x115310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 25000 KB/s
Rx Throughput : 68000 KB/s
Fan Speed : 65 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 8121 MiB
Used : 7826 MiB
Free : 295 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 74 %
Memory : 42 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 79 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 99 C
Power Readings
Power Management : Supported
Power Draw : 98.35 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 10.00 W
Max Power Limit : 120.00 W
Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3004 MHz
Video : 712 MHz
Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Default Applications Clocks
Graphics : 772 MHz
Memory : 3005 MHz
Max Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 3005 MHz
Video : 710 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes
Process ID : 11024
Type : C
Name : C:\Users\brandon5\AppData\Local\Programs\Python\Python35\python.exe
Used GPU Memory : 7826 MiB

Robert_Crovella · April 12, 2017, 9:43pm

From the output you posted:

The GPU that is not supporting P2P is the GPU that is in WDDM mode. You can switch it to TCC mode using nvidia-smi (use nvidia-smi --help to get command-line help) but you won’t be able to do that if the GPU is supporting the system primary display, and a GPU in TCC mode cannot support a display.

brandon5 · April 12, 2017, 9:50pm

Thank you for your response, txbob. I see. I still need one of them to support the display anyway. In this case, do I still have the benefit to set one to TCC mode; or will the performance drop/similar/improve if both are set to WDDM mode? Thank you.

njuffa · April 12, 2017, 11:49pm

If both GPUs are in WDDM mode you won’t be able to use P2P communication. As txbob says, P2P requires TCC, while the Windows GUI requires WDDM. That is because a device with a TCC driver is seen by the OS as a 3D controller, and as such is not able to drive the GUI. There are of course good performance reasons to use TCC for GPUs used for compute, independent of P2P. It seems that is the setup you have already.

For Windows systems where the predominant use of GPUs is for compute, it is best to use TCC mode for all (high-end) GPUs that support it, and leave one (low-end) GPU in WDDM mode for the GUI. I have in the past used the cheapest available Quadros for GUI support together with high-end Teslas or Quadros for compute. This worked for me because I had minimal visualization needs. Your use case may differ. The cheapest Quadros are typically around $130, the current SKU in that role is the Quadro P400.

brandon5 · April 12, 2017, 11:57pm

Thank you for your information, njuffa.

nunez.juan · September 17, 2018, 10:28pm

@Moderator: I know this is very old, why does P2P not work between TCC and WDDM GPUs? There must be some industry need to process data on a TCC GPU to then want to move that data directly to a WDDM GPU for display purposes - why force industry to do a double-copy of the data when the ability via PCIe is there to perform a DMA.

According to online posts on various forums, TCC <–> WDDM data transfer is not supported with P2P, GpuDirect or NVLINK, what gives?

Robert_Crovella · September 17, 2018, 10:56pm

The WDDM driver model is provided by Microsoft and defined by them, not NVIDIA. Limitations on feature support in a WDDM system arise from this characteristic.

xsacha · November 11, 2018, 10:46pm

Interestingly, the CUDA 10 drivers seem to announce support for this:
Added support for peer-to-peer (P2P) with CUDA on Windows (WDDM 2.0+ only).

Robert_Crovella · November 11, 2018, 10:55pm

Yes, that is a new feature in CUDA 10 drivers.

nunez.juan · November 13, 2018, 5:32pm

But does it support P2P between a TCC and a WDDM GPU?

TCC <--> TCC: Yes (CUDA8, maybe earlier?)
WDDM <--> WDDM: Yes (CUDA10)
TCC <--> WDDM: ???

nunez.juan · November 14, 2018, 12:44am

So I gave this a shot and peer-to-peer did not work for me on a Win10 with the latest NVIDIA driver between two GPUs (M4000 and M5000) both in WDDM mode (2.1).

Needless to say, it didn’t work with one GPU in TCC either.

I installed the CUDA 10 toolkit, Downloaded and installed the latest 64bit Win10 NVIDIA driver on the Win10 machine. Built a CUDA application (confirmed it used nvcc from the CUDA10 toolkit). Ran it on the Win10 machine and the cudaDeviceCanAccessPeer returned false in both directions.

Robert_Crovella · November 14, 2018, 2:24am

Topologically, you need a machine where both GPUs are on the same fabric, as well. This has always been the case for P2P support.

Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

nunez.juan · November 14, 2018, 4:07pm

You just spoke over my head :-)

I am using an HP Z640 Workstation with the HP 710325-002 Motherboard.

Quadro M5000 Slot 2 PCIe3 x16
Quadro M4000 Slot 5 PCIe3 x16

nunez.juan · November 14, 2018, 4:44pm

Hypothetically, what if there was a PCIe bridge between the GPUs?

I imagine this is OK since the Peer-to-peer data transfer is still DMA.

njuffa · November 14, 2018, 5:56pm

Not my area of expertise, but P2P support has traditionally required the GPUs to be endpoints of the same PCIe root complex. The PCIe root complex is typically part of the CPU these days. So the easiest explanation for your observation is that your system is a dual CPU socket platform, where the two GPUs currently are in slots served by different root complexes. Does that apply to your workstation, by any chance? I am not familiar with the HP SKU you mentioned.

What is the relationship between a PCIe-to-PCIe bridge and a PCIe root complex, if any? Is the bridge completely transparent (suitable for extending a PCIe root complex)? Does your workstation feature a PCIe bridge?

nunez.juan · November 14, 2018, 6:33pm

Also did not work with two Quadro P2000’s. cudaDeviceCanAccessPeer returned false in both directions.

Setup:

HP Z640 Workstation with the HP 710325-002 Motherboard.
A single Intel Xeon E5-1650 V4
Windows 10 (10.0 Build 14393) Enterprise 2016, 64-bit
Both in WDDM (2.1).
Driver: 411.81-quadro-grid-desktop-notebook-win10-64bit-international-whql.
CUDA10: cuda_10.0.130_411.31_windows.
CUDA10 DLL: cudart64_100.dll.

nunez.juan · November 14, 2018, 6:40pm

The HP Z640 Workstation does not have a PCIe bridge.

It has an Intel® Xeon® Processor E5-1650 v4 installed.

6 cores, 12 Threads.

I don’t have an answer to your root-complex question. Will try to find out.

nunez.juan · November 14, 2018, 7:39pm

@Robert_Crovella: Would you happen to know if this new CUDA10/Win10/WDDM2.x feature is limited to NVLink as the communication between GPUs and not PCIe? Or will it work over PCIe as well?

Thanks

njuffa · November 14, 2018, 8:34pm

Looking at Intel ARK, the E5-1650 v4 is a processor designed for 1S systems, and sports 40 PCIe lanes. I have no further hypotheses why P2P doesn’t work (as I said, not my area of expertise).

In practical terms, if you are desperate enough, you could try the GPUs in different PCIe slots to check whether that makes a difference.

nunez.juan · November 14, 2018, 9:57pm

I know P2P is working while the two GPUs are in TCC mode - I added a third GPU and changes the two P2000’s to TCC.
– This tells me that physical layer isn’t the issue.

I am guessing that the capability truly hasn’t been enabled in CUDA10, the current version of 10 that is. Maybe NVIDIA will enable it in version 10.x.x.x.

Topic		Replies	Views
P2P not working for P600s? CUDA Programming and Performance	7	1825	April 5, 2018
Confused about GTX Titan Z Peer-To-Peer (P2) capability CUDA Programming and Performance	19	5098	February 23, 2015
Two Quadro M4000 capable of P2P but no access CUDA Programming and Performance	10	1945	November 7, 2016
Peer access not supported between devices CUDA Programming and Performance	11	7320	November 9, 2017
MultiGPU P2P Access Weird result. CUDA Programming and Performance	10	1162	June 13, 2016
Can a P2P multi-GPU setup be used for the following application... GPU - Hardware	0	684	May 21, 2018
P2P access not enabled, is this a software or a hardware issue? CUDA Setup and Installation	7	9730	November 10, 2015
Standard nVidia CUDA tests fail with dual RTX 4090 Linux box Linux	54	21419	April 29, 2024
Failed to simpleP2P CUDA Setup and Installation	3	132	August 26, 2024
SimpleP2P with Quadro P4000 and GeForce GTX 1080Ti not available? CUDA Setup and Installation	1	816	December 17, 2017

One GPU NOT capable of Peer-to-Peer (P2P)

Related topics