Two Quadro M4000 capable of P2P but no access

I want to run a multiGPU program in my system with 2 Quadro M4000 (I’m using Ubuntu 16.04 64 bits and CUDA 8.0).

I ran the simpleP2P test and I realized that they cannot have access to each other.

./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "   Quadro M4000" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "   Quadro M4000" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Quadro M4000 (GPU0) -> Quadro M4000 (GPU1) : No
> Peer access from Quadro M4000 (GPU1) -> Quadro M4000 (GPU0) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

I would like to attach the nvidia-smi -a results:

==============NVSMI LOG==============

Timestamp                           : Thu Nov  3 14:48:34 2016
Driver Version                      : 367.48

Attached GPUs                       : 2
GPU 0000:01:00.0
    Product Name                    : Quadro M4000
    Product Brand                   : Quadro
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0323416045901
    GPU UUID                        : GPU-0ef125ab-4e4c-cb60-20f7-d94d8be6375d
    Minor Number                    : 0
    VBIOS Version                   : 84.04.88.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x100
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G400.0501.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x01
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x13F110DE
        Bus Id                      : 0000:01:00.0
        Sub System Id               : 0x115310DE
        GPU Link Info
            PCIe Generation
                Max                 : 2
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 5000 KB/s
    Fan Speed                       : 46 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 8120 MiB
        Used                        : 148 MiB
        Free                        : 7972 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 2 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 36 C
        GPU Shutdown Temp           : 104 C
        GPU Slowdown Temp           : 99 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 16.51 W
        Power Limit                 : 120.00 W
        Default Power Limit         : 120.00 W
        Enforced Power Limit        : 120.00 W
        Min Power Limit             : 10.00 W
        Max Power Limit             : 120.00 W
    Clocks
        Graphics                    : 135 MHz
        SM                          : 135 MHz
        Memory                      : 324 MHz
        Video                       : 405 MHz
    Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Default Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Max Clocks
        Graphics                    : 772 MHz
        SM                          : 772 MHz
        Memory                      : 3005 MHz
        Video                       : 710 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes
        Process ID                  : 932
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 101 MiB
        Process ID                  : 1612
            Type                    : G
            Name                    : compiz
            Used GPU Memory         : 45 MiB

GPU 0000:02:00.0
    Product Name                    : Quadro M4000
    Product Brand                   : Quadro
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0320916049028
    GPU UUID                        : GPU-e8d10210-2bff-57fa-0ae7-555d49adb1fb
    Minor Number                    : 1
    VBIOS Version                   : 84.04.88.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G400.0501.01.03
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x13F110DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x115310DE
        GPU Link Info
            PCIe Generation
                Max                 : 1
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 4x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 46 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 8120 MiB
        Used                        : 1 MiB
        Free                        : 8119 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 4 MiB
        Free                        : 252 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 31 C
        GPU Shutdown Temp           : 104 C
        GPU Slowdown Temp           : 99 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 10.60 W
        Power Limit                 : 120.00 W
        Default Power Limit         : 120.00 W
        Enforced Power Limit        : 120.00 W
        Min Power Limit             : 10.00 W
        Max Power Limit             : 120.00 W
    Clocks
        Graphics                    : 135 MHz
        SM                          : 135 MHz
        Memory                      : 324 MHz
        Video                       : 405 MHz
    Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Default Applications Clocks
        Graphics                    : 772 MHz
        Memory                      : 3005 MHz
    Max Clocks
        Graphics                    : 772 MHz
        SM                          : 772 MHz
        Memory                      : 3005 MHz
        Video                       : 710 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

Also, the results of nvidia-smi topo -m

GPU0	GPU1	CPU Affinity
GPU0	 X 	PHB	0-3
GPU1	PHB	 X 	0-3

Am I missing something?

I think that with this config I should have access from one to each other. So what is the problem?

Diagnosing such issues is txbob’s speciality, but I’ll give it a try. Is this a dual CPU socket system, by any chance? It looks to me like the two GPUs are on different PCIe root complexes (each CPU provides its own PCIe root complex). GPUs on different PCIe root complexes cannot talk directly to each other due to limitations in the QPI link that connects the CPUs.

You would definitely want to wait for txbob’s insights.

I don’t think so. Because the results of nvidia-smi topo -m say that the two of them have the same CPU affinity. Thanks anyway I will wait for txbob’s insights.

Interesting point about the “CPU Affinity”, I had not paid attention to that (still learning).

CPU affinity has little to do with P2P access. The fact that you found affinity masks set to the same value may suggest both GPUs are attached to the same CPU. However CPU affinity is freely configurable, while naturally this call can’t physically reattach GPUs to different CPUs.

So rather than looking at CPU affinity masks I’d look at the PCI bus numbers the GPUs are connected to. These are different in your case, so as njuffa said, your GPUs can’t directly talk to each other.

If you are using a Mainboard with four PCIe x16 slots rather than two, there is a high chance you can enable P2P access by moving one GPU to a different slot.
However this will involve performance tradeoffs depending on your use case, as this will make your configuration asymmetric and both GPUs will be handled by the same GPU. Particularly so if the two slots on each CPU are connected via a PCIe switch rather than directly to the CPU, so that they share the same bandwidth to the host.

This is the motherboard.

https://www.msi.com/Motherboard/870G45.html#hero-overview

There are two PCIe x16 slots and only one CPU. I think they are attached to the same.

I had based my diagnosis of multiple PCIe root complexes on the PCI bus numbers mentioned by tera, as each GPUs show up on as device 0 on a different PCI bus. That is what led me to suspect that this is a dual CPU socket machine, so I am a surprised to now see that it is a motherboard with a single CPU socket. Not sure what to make of that.

It’s an AMD motherboard. The system will not enable P2P support for unrecognized core logic. It’s possible that the specific core logic/CPU used here is not recognized by the CUDA runtime, therefore it may be unsupported.

If you use a motherboard with well-known intel server or workstation processors/core logic, it should probably work.

This is really just a guess.

Furthermore, let’s not assume this is some sinister plot on the part of NVIDIA. The supported processors/core logic for P2P are developed into a whitelist based on the systems that our development teams know about and can test.

It’s possible that:

  1. This particular AMD core logic actually has a problem with socket-level P2P (i.e. PHB) (i.e. we tested it, and it failed)
  2. We’ve simply never seen this setup before, so it didn’t make it into the white list.

P2P is not guaranteed to work in any system that you throw together. Support is enabled according to a whitelist. No, the whitelist is not published. Again, popular intel server and workstation processors/core logic should be supported. AMD systems may work as well. It’s possible that this one is not supported.

Don’t invest in hardware to support specific GPU system architectural capabilities unless you have confirmation that it is supported.

Thank you txbob for your (very useful as usual) insight.

As the AMD 770 chipset only supports one PCIe x16 and one PCIe x4 port (plus two x1 ports that usually use slots mechanically incompatible with x16 cards), I suspect your option 2. Nvidia probably never looked into enabling P2P for a system that would only allow PCIe 2.0 x4 bandwidth (2 GB/s).

Sorry I had gone off in the wrong direction because of the mention of CPU affinity, which had made me expect this system to have more than one CPU socket. I should not have commented before asking for “lspci -v” and/or “lspci -t” output.

Also, Peer access between two devices is disabled if either is in SLI mode. That might be something to check here, since the nvidia-smi output for one of the two devices lists display active. I’m not sure how a slave SLI device appears in nvidia-smi, whether it shows display active or disabled.

nvidia-smi results

nvidia-smi
Mon Nov  7 12:52:36 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 0000:01:00.0      On |                  N/A |
| 48%   48C    P0    44W / 120W |    194MiB /  8120MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro M4000        Off  | 0000:02:00.0     Off |                  N/A |
| 46%   33C    P8    10W / 120W |      1MiB /  8120MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       955    G   /usr/lib/xorg/Xorg                             109MiB |
|    0      1642    G   compiz                                          83MiB |
+-----------------------------------------------------------------------------+