Hello,
Recently I am trying to run tensorflow, I have 4 Tesla K80s installed on my machine, but when the tensorflow application launched, GPUs seems cannot talk to each other:
2017-10-22 19:15:29.254697: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x10029b69bc0
2017-10-22 19:15:29.303694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 2 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0004:03:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2017-10-22 19:15:29.304429: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x10029b5dbc0
2017-10-22 19:15:29.354736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 3 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0004:04:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2017-10-22 19:15:29.354794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 1
2017-10-22 19:15:29.354813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 2
2017-10-22 19:15:29.354830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 0 and 3
2017-10-22 19:15:29.354847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 0
2017-10-22 19:15:29.354864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 2
2017-10-22 19:15:29.354879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 1 and 3
2017-10-22 19:15:29.354895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 0
2017-10-22 19:15:29.354911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 1
2017-10-22 19:15:29.354928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 2 and 3
2017-10-22 19:15:29.354943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 0
2017-10-22 19:15:29.354959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 1
2017-10-22 19:15:29.354974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:779] Peer access not supported between device ordinals 3 and 2
2017-10-22 19:15:29.355070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1 2 3
2017-10-22 19:15:29.355079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y N N N
2017-10-22 19:15:29.355088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1: N Y N N
2017-10-22 19:15:29.355096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 2: N N Y N
2017-10-22 19:15:29.355103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 3: N N N Y
2017-10-22 19:15:29.355130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0002:03:00.0)
2017-10-22 19:15:29.355141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0002:04:00.0)
2017-10-22 19:15:29.355152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0004:03:00.0)
2017-10-22 19:15:29.355161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0004:04:00.0)
2017-10-22 19:15:29.625719: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 4 visible devices
2017-10-22 19:15:29.625756: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 80 visible devices
2017-10-22 19:15:29.629803: I tensorflow/compiler/xla/service/service.cc:183] XLA service 0x10028f1e140 executing computations on platform Host. Devices:
2017-10-22 19:15:29.629844: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (0): <undefined>, <undefined>
2017-10-22 19:15:29.630685: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 4 visible devices
2017-10-22 19:15:29.630698: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 80 visible devices
2017-10-22 19:15:29.634661: I tensorflow/compiler/xla/service/service.cc:183] XLA service 0x10028f865d0 executing computations on platform CUDA. Devices:
2017-10-22 19:15:29.634702: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2017-10-22 19:15:29.634733: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (1): Tesla K80, Compute Capability 3.7
2017-10-22 19:15:29.634758: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (2): Tesla K80, Compute Capability 3.7
2017-10-22 19:15:29.634783: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (3): Tesla K80, Compute Capability 3.7
.2017-10-22 19:15:29.647023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0002:03:00.0)
2017-10-22 19:15:29.647043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0002:04:00.0)
2017-10-22 19:15:29.647053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0004:03:00.0)
2017-10-22 19:15:29.647063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0004:04:00.0)
Here is the output from nvidia-smi -a
:
==============NVSMI LOG==============
Timestamp : Sun Oct 22 18:53:32 2017
Driver Version : 384.66
Attached GPUs : 4
GPU 00000002:03:00.0
Product Name : Tesla K80
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322015005586
GPU UUID : GPU-79fbbca5-be02-5eb7-a9c0-1111f4153358
Minor Number : 2
VBIOS Version : 80.21.1B.00.01
MultiGPU Board : No
Board ID : 0x20300
GPU Part Number : 900-22080-0404-030
Inforom Version
Image Version : 2080.0200.00.04
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0002
Device Id : 0x102D10DE
Bus Id : 00000002:03:00.0
Sub System Id : 0x106C10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
FB Memory Usage
Total : 11439 MiB
Used : 0 MiB
Free : 11439 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A [474/1935]
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Texture Shared : N/A
CBU : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 36 C
GPU Shutdown Temp : 93 C
GPU Slowdown Temp : 88 C
Memory Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 61.20 W
Power Limit : 149.00 W
Default Power Limit : 149.00 W
Enforced Power Limit : 149.00 W
Min Power Limit : 100.00 W
Max Power Limit : 175.00 W
Clocks
Graphics : 562 MHz
SM : 562 MHz
Memory : 2505 MHz
Video : 540 MHz
Applications Clocks
Graphics : 562 MHz
Memory : 2505 MHz
Default Applications Clocks
Graphics : 562 MHz
Memory : 2505 MHz
Max Clocks
Graphics : 875 MHz
SM : 875 MHz
Memory : 2505 MHz
Video : 540 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes : None
GPU 00000002:04:00.0
...
GPU 00000004:03:00.0
...
GPU 00000004:04:00.0
...
Can I have some advice to make P2P access enabled?
Thanks.