Peer access not supported between devices

Hello,

Recently I am trying to run tensorflow, I have 4 Tesla K80s installed on my machine, but when the tensorflow application launched, GPUs seems cannot talk to each other:

2017-10-22 19:15:29.254697: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x10029b69bc0
2017-10-22 19:15:29.303694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0004:03:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2017-10-22 19:15:29.304429: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x10029b5dbc0
2017-10-22 19:15:29.354736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 3 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0004:04:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2017-10-22 19:15:29.354794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 1
2017-10-22 19:15:29.354813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 2
2017-10-22 19:15:29.354830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 3
2017-10-22 19:15:29.354847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 0
2017-10-22 19:15:29.354864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 2
2017-10-22 19:15:29.354879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 3
2017-10-22 19:15:29.354895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 0
2017-10-22 19:15:29.354911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 1
2017-10-22 19:15:29.354928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 3
2017-10-22 19:15:29.354943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 0
2017-10-22 19:15:29.354959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 1
2017-10-22 19:15:29.354974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 2
2017-10-22 19:15:29.355070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3
2017-10-22 19:15:29.355079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y N N N
2017-10-22 19:15:29.355088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   N Y N N
2017-10-22 19:15:29.355096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2:   N N Y N
2017-10-22 19:15:29.355103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3:   N N N Y
2017-10-22 19:15:29.355130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0002:03:00.0)
2017-10-22 19:15:29.355141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0002:04:00.0)
2017-10-22 19:15:29.355152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0004:03:00.0)
2017-10-22 19:15:29.355161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0004:04:00.0)
2017-10-22 19:15:29.625719: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 4 visible devices
2017-10-22 19:15:29.625756: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 80 visible devices
2017-10-22 19:15:29.629803: I tensorflow/compiler/xla/service/service.cc:183] XLA service 0x10028f1e140 executing computations on platform Host. Devices:
2017-10-22 19:15:29.629844: I tensorflow/compiler/xla/service/service.cc:191]   StreamExecutor device (0): <undefined>, <undefined>
2017-10-22 19:15:29.630685: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 4 visible devices
2017-10-22 19:15:29.630698: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 80 visible devices
2017-10-22 19:15:29.634661: I tensorflow/compiler/xla/service/service.cc:183] XLA service 0x10028f865d0 executing computations on platform CUDA. Devices:
2017-10-22 19:15:29.634702: I tensorflow/compiler/xla/service/service.cc:191]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2017-10-22 19:15:29.634733: I tensorflow/compiler/xla/service/service.cc:191]   StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2017-10-22 19:15:29.634758: I tensorflow/compiler/xla/service/service.cc:191]   StreamExecutor device (2): Tesla K80, Compute Capability 3.7
2017-10-22 19:15:29.634783: I tensorflow/compiler/xla/service/service.cc:191]   StreamExecutor device (3): Tesla K80, Compute Capability 3.7
.2017-10-22 19:15:29.647023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0002:03:00.0)
2017-10-22 19:15:29.647043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0002:04:00.0)
2017-10-22 19:15:29.647053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0004:03:00.0)
2017-10-22 19:15:29.647063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0004:04:00.0)

Here is the output from nvidia-smi -a:

==============NVSMI LOG==============

Timestamp                           : Sun Oct 22 18:53:32 2017
Driver Version                      : 384.66

Attached GPUs                       : 4
GPU 00000002:03:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322015005586
    GPU UUID                        : GPU-79fbbca5-be02-5eb7-a9c0-1111f4153358
    Minor Number                    : 2
    VBIOS Version                   : 80.21.1B.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x20300
    GPU Part Number                 : 900-22080-0404-030
    Inforom Version
        Image Version               : 2080.0200.00.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x03
        Device                      : 0x00
        Domain                      : 0x0002
        Device Id                   : 0x102D10DE
        Bus Id                      : 00000002:03:00.0
        Sub System Id               : 0x106C10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
    FB Memory Usage
        Total                       : 11439 MiB
        Used                        : 0 MiB
        Free                        : 11439 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : N/A
                CBU                 : N/A                                                                                                                                                   [474/1935]
                Total               : 0
        Aggregate
            Single Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
            Double Bit
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 36 C
        GPU Shutdown Temp           : 93 C
        GPU Slowdown Temp           : 88 C
        Memory Temp                 : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 61.20 W
        Power Limit                 : 149.00 W
        Default Power Limit         : 149.00 W
        Enforced Power Limit        : 149.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 175.00 W
    Clocks
        Graphics                    : 562 MHz
        SM                          : 562 MHz
        Memory                      : 2505 MHz
        Video                       : 540 MHz
    Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Default Applications Clocks
        Graphics                    : 562 MHz
        Memory                      : 2505 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 2505 MHz
        Video                       : 540 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

GPU 00000002:04:00.0
...
GPU 00000004:03:00.0
...
GPU 00000004:04:00.0
...

Can I have some advice to make P2P access enabled?
Thanks.

What is the system platform used here (vendor name, model number)? Did you acquire the system including the GPUs from a system integrator that is an NVIDIA partner, or did you put it together yourself?

What CPU is being used, how many CPU sockets are there? What’s the output of nvidia-smi topo -m? What’s the output of lspci -t?

I am guessing that this is a dual CPU system where each CPU is connected to two K80s, and each CPU provides its own PCIe root complex. P2P requires that GPUs are on the same root complex.

Thank you, it’s an IBM POWER8 server, Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-96-generic ppc64le).

There are 2 CPU sockets on it.

nvidia-smi topo -m:

GPU0	GPU1	GPU2	GPU3	CPU Affinity
GPU0	 X 	PIX	SOC	SOC	80-83,88-91,96-99,104-107,112-115,120-123,128-131,136-139,144-147,152-155
GPU1	PIX	 X 	SOC	SOC	80-83,88-91,96-99,104-107,112-115,120-123,128-131,136-139,144-147,152-155
GPU2	SOC	SOC	 X 	PIX	0-3,8-11,16-19,24-27,32-35,40-43,48-51,56-59,64-67,72-75
GPU3	SOC	SOC	PIX	 X 	0-3,8-11,16-19,24-27,32-35,40-43,48-51,56-59,64-67,72-75

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

lspci -t:

-+-[0004:00]---00.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0
 |                                           \-10.0-[04]----00.0
 +-[0003:00]---00.0-[01]----00.0
 +-[0002:00]---00.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0
 |                                           \-10.0-[04]----00.0
 +-[0001:00]---00.0-[01-11]--+-00.0-[02-11]--+-01.0-[03]--+-00.0
 |                           |               |            +-00.1
 |                           |               |            +-00.2
 |                           |               |            \-00.3
 |                           |               +-08.0-[04-08]--
 |                           |               +-09.0-[09]----00.0
 |                           |               +-0a.0-[0a]----00.0
 |                           |               +-0b.0-[0b-0c]----00.0-[0c]----00.0
 |                           |               \-0c.0-[0d-11]--
 |                           +-00.1
 |                           +-00.2
 |                           +-00.3
 |                           \-00.4
 \-[0000:00]---00.0-[01]----00.0

I am not familiar with Power8 systems. The lack of P2P support could potentially trace back to the “SOC” connections, which would seem to indicate a connection between different PCIe root complexes via a different interconnect. But that’s just a guess. txbob may have a better idea of how this wants to work.

Have you checked with the system vendor who integrated the GPUs (i.e., IBM) on this issue?

I’m seeing the same problem, but interestingly it is only on one of my two Power8 systems. The other one shows P2P support between GPU 0 and 1 and also 2 and 3. Both of my nodes are running Ubuntu 16.04, but slightly different kernel revisions. They both have Cuda 8.0.61 installed, but the one that works has driver version 375.51 and the one that doesn’t work has 384.81. I’m wondering if there could be a problem with the later driver or an incompatibility with that driver and the runtime.
The other thing that I see that is different between the 2 systems is that in the output of nvidia-smi -a the broken system has
MultiGPU Board : No
for all the GPUs, but that is yes on the other system. Is it possible that there is a bios or firmware setting that is different?

Easy enough to find out: Install the older driver on the problematic machine and see whether that fixes the issue. Obviously, driver bugs are a fact of life, but I think it is unlikely that they would affect basic functionality such as this.

That seems interesting and quite possibly significant. From the name (“board”), it looks like a hardware configuration option, though, rather than a firmware setting. You might want to inquire about this with the hardware vendor (that is, the system integrator who delivered the machine).

On second thought, that probably just means you have different type GPUs installed in these two machines: One which uses dual-GPU boards (like the Tesla K80), while the other one uses single-GPU boards (like the Tesla P100). What are the respective GPU configurations for these two machines?

It was the driver version. I loaded 375.88 and MultiGPU Board now shows Yes for all GPUs and the SimpleP2P shows peer access between the first 2 and second 2 boards. It looks like there is one more driver version between this one on the one that was broken. I’ll test that to see where it broke and then open a ticket with Nvidia.

That’s good sleuthing. Yeah, definitely file a high-priority bug with NVIDIA. Frankly, I am surprised that the driver change even changes the MultiGPU status displayed by nvidia-smi, but there’s always a first time for everything …

If it is already fixed in a newer driver, there is little point in filing a bug.

Standard debugging (and QA) practice is to retest any anomalies on the latest driver.

Power8 P2P support depends on a number of factors.

Systems with K80 may be (older) S822LC or S824L systems, and they have no defined P2P support except (perhaps) between the two GPUs that together comprise a single K80 unit. (ignoring the particular issue here, whatever it is). I actually don’t know for certain that P2P is officially supported even between the 2 GPUs on a single K80. This is because it requires special programming of the PLX bridge in the K80, and I don’t know for sure that these older IBM systems do that.

P2P support on Minsky (S822LC for HPC) should be between each of the 2 GPUs on the two CPU islands. No P2P support across the X-bus (so a GPU connected to one processor socket does not have P2P access to a GPU on another processor socket, similar to what is common in x86-land).

P2P support on future P9 ppc64le systems may advance beyond this, but I’m not able to discuss details at this time, and anyway support for these systems should be available through IBM.

Yes, if a newer driver fixed it then I would move to the newer driver and not file a bug. In this case the problem exists in the latest driver that I see for CUDA 8.0 which is 384.81. I’m having some trouble getting apt to install 384.66 to see if that one is also affected, but I guess I can work on that if the developers really need to know.
My systems are S822LC, but I didn’t think that was Minsky. With the old drivers that doesn’t exhibit the problem we’re only seeing P2P between the first 2 and second 2 GPUs that show up to the OS. I believe that is 2 halves of the same unit. I’m not expecting P2P between the 2 units because I know that they are connected to different CPU sockets.

In general, newer drivers support older version of CUDA (the reverse is not true: each CUDA version has a minimum driver version required to run it). I would assume this applies to the PowerPC platform as well, although I couldn’t say for sure, never having used it.

There should be no risk in trying the latest available driver for your hardware platform. If, for any reason, it doesn’t work, you can simply revert to the driver you are running now.

There are multiple S822LC systems from IBM. The specific moniker “S822LC for HPC” is Minsky. You would not find K80’s plugged into a Minsky box. There are older S822LC systems that did ship with K80. (And S824L, also)

I don’t know what P2P behavior to expect for K80’s in a S822LC. It’s entirely possible that an issue was found in supporting P2P and so support was dropped from newer drivers. Anyway, filing a bug and/or escalating thru IBM is the proper path if you want to investigate.

I’ve also edited my previous post.