I am having trouble getting P2P communication to work between two 1080ti. Calling the CUDA function
cudaDeviceEnablePeerAccess
causes the GUI to freeze and the program to hang indefinitely. I can, however, execute all the NCCL examples between both GPUs successfully. I first discovered this problem when attempting to distribute training across both GPUs via Tensorflow. After researching the problem further I was able to confirm that it can be replicated with simpleP2P and p2pBandwidthLatencyTest.
I have thoroughly reviewed the existing information on this and have taken the following actions:
- disabled IOMMU in both grub and BIOS
- disabled ACS in BIOS
- upgraded BIOS
- upgraded NCCL via apt
System Specs:
- Ubuntu 19.04
- AMD Ryzen 7 2700X
- ASUS ROG B450-F Gaming Board
- 2 X NVIDIA GTX 1080ti (EVGA) - PCIex16_1 & PCIex16_2
- 32GB 2440MHz RAM
- CUDA 10.1 (Main Install, Other versions on system: 9.0 & 10.0)
- NVIDIA Driver: 418.56
- NCCL: 2.5.6
Output from deviceQuery (v10.1):
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "GeForce GTX 1080 Ti"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11170 MBytes (11713052672 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1671 MHz (1.67 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 9 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "GeForce GTX 1080 Ti"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11178 MBytes (11721506816 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1671 MHz (1.67 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 10 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS
Output from
nvidia-smi topo -m
GPU0 GPU1 CPU Affinity
GPU0 X PHB 0-15
GPU1 PHB X 0-15
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
Output from
nvidia-smi
at rest
Sat Feb 15 10:04:34 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:09:00.0 On | N/A |
| 11% 52C P5 18W / 250W | 664MiB / 11170MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:0A:00.0 Off | N/A |
| 0% 36C P8 8W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2099 G /usr/lib/xorg/Xorg 271MiB |
| 0 2856 G /usr/bin/kwin_x11 96MiB |
| 0 2861 G /usr/bin/krunner 2MiB |
| 0 2863 G /usr/bin/plasmashell 51MiB |
| 0 2898 G /usr/bin/latte-dock 35MiB |
| 0 3298 G /usr/bin/akonadi_archivemail_agent 3MiB |
| 0 3307 G /usr/bin/akonadi_mailfilter_agent 3MiB |
| 0 3311 G /usr/bin/akonadi_sendlater_agent 3MiB |
| 0 3312 G /usr/bin/akonadi_unifiedmailbox_agent 3MiB |
| 0 3792 G ...AAAAAAAAAAAAAAgAAAAAAAAA --shared-files 152MiB |
| 0 21472 G ...equest-channel-token=935638389533155474 38MiB |
+-----------------------------------------------------------------------------+
Output from NCCL Test
all_reduce_perf -g 2
(v2.5.6)
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 2343 on ubuntu device 0 [0x09] GeForce GTX 1080 Ti
# Rank 1 Pid 2343 on ubuntu device 1 [0x0a] GeForce GTX 1080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum 16523 2.03 2.03 0e+00 16525 2.03 2.03 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 2.03066
#
Attached is the nvidia-bug-report.sh output file. I had to terminate the script after it hung for 15+ minutes, as a result it may not be complete. I was unable to run the simpleP2P for the nvidia-bug-report.sh because it would freeze my machine and prevent me from using SSH on a different machine. I was, however, able to run the p2pBandwidthLatencyTest and successfully executed the nvidia-bug-report.sh script from a remote shell. To be clear, the p2pBandwidthLatencyTest exhibits nearly the exact same behavior as the simpleP2P program does. The only difference being I was able to access the machine remotely after executing the p2pBandwidthLatencyTest but not with the simpleP2P program.
I am happy to share any information that would help find a solution. Thanks for any and all help!