P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled

samson98ulrich · February 15, 2020, 4:11pm

I am having trouble getting P2P communication to work between two 1080ti. Calling the CUDA function

cudaDeviceEnablePeerAccess

causes the GUI to freeze and the program to hang indefinitely. I can, however, execute all the NCCL examples between both GPUs successfully. I first discovered this problem when attempting to distribute training across both GPUs via Tensorflow. After researching the problem further I was able to confirm that it can be replicated with simpleP2P and p2pBandwidthLatencyTest.

I have thoroughly reviewed the existing information on this and have taken the following actions:
- disabled IOMMU in both grub and BIOS
- disabled ACS in BIOS
- upgraded BIOS
- upgraded NCCL via apt

System Specs:
- Ubuntu 19.04
- AMD Ryzen 7 2700X
- ASUS ROG B450-F Gaming Board
- 2 X NVIDIA GTX 1080ti (EVGA) - PCIex16_1 & PCIex16_2
- 32GB 2440MHz RAM
- CUDA 10.1 (Main Install, Other versions on system: 9.0 & 10.0)
- NVIDIA Driver: 418.56
- NCCL: 2.5.6

Output from deviceQuery (v10.1):

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11170 MBytes (11713052672 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 10 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS

Output from

nvidia-smi topo -m

GPU0    GPU1    CPU Affinity
GPU0     X      PHB     0-15
GPU1    PHB      X      0-15

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Output from

nvidia-smi

at rest

Sat Feb 15 10:04:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:09:00.0  On |                  N/A |
| 11%   52C    P5    18W / 250W |    664MiB / 11170MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   36C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2099      G   /usr/lib/xorg/Xorg                           271MiB |
|    0      2856      G   /usr/bin/kwin_x11                             96MiB |
|    0      2861      G   /usr/bin/krunner                               2MiB |
|    0      2863      G   /usr/bin/plasmashell                          51MiB |
|    0      2898      G   /usr/bin/latte-dock                           35MiB |
|    0      3298      G   /usr/bin/akonadi_archivemail_agent             3MiB |
|    0      3307      G   /usr/bin/akonadi_mailfilter_agent              3MiB |
|    0      3311      G   /usr/bin/akonadi_sendlater_agent               3MiB |
|    0      3312      G   /usr/bin/akonadi_unifiedmailbox_agent          3MiB |
|    0      3792      G   ...AAAAAAAAAAAAAAgAAAAAAAAA --shared-files   152MiB |
|    0     21472      G   ...equest-channel-token=935638389533155474    38MiB |
+-----------------------------------------------------------------------------+

Output from NCCL Test

all_reduce_perf -g 2

(v2.5.6)

# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   2343 on     ubuntu device  0 [0x09] GeForce GTX 1080 Ti
#   Rank  1 Pid   2343 on     ubuntu device  1 [0x0a] GeForce GTX 1080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608   float     sum    16523    2.03    2.03  0e+00    16525    2.03    2.03  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.03066 
#

Attached is the nvidia-bug-report.sh output file. I had to terminate the script after it hung for 15+ minutes, as a result it may not be complete. I was unable to run the simpleP2P for the nvidia-bug-report.sh because it would freeze my machine and prevent me from using SSH on a different machine. I was, however, able to run the p2pBandwidthLatencyTest and successfully executed the nvidia-bug-report.sh script from a remote shell. To be clear, the p2pBandwidthLatencyTest exhibits nearly the exact same behavior as the simpleP2P program does. The only difference being I was able to access the machine remotely after executing the p2pBandwidthLatencyTest but not with the simpleP2P program.

I am happy to share any information that would help find a solution. Thanks for any and all help!

samson98ulrich · February 15, 2020, 7:23pm

I corrected this issue with a complete system wipe of all CUDA and NVIDIA driver items. Once the system was completely purged I rebooted and installed NVIDIA driver 418.87 and CUDA 10.1 (Upgrade 2). After doing so, along with disabling IOMMU and ACS I was able to successfully execute all P2P operations.

I am removing sensitive system information (i.e. nvidia-bug-report and dmesg) but will leave the rest of the details up for anyone facing a similar issue.

Topic		Replies	Views
simpleP2P example and multi-GPU network training causes system freeze and ERR in nvidia-smi Linux	7	4154	October 14, 2021
Direct GPU <-> GPU communication does not seem to work properly CUDA Programming and Performance	9	1613	February 22, 2024
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2574	October 12, 2021
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5455	November 27, 2018
Standard nVidia CUDA tests fail with dual RTX 4090 Linux box Linux	54	22103	April 29, 2024
Multi-GPU Peer to Peer access failing on Tesla K80 CUDA Programming and Performance	25	26306	November 24, 2016
SimpleP2P failed using Tesla K80, Windows server 2012 R2, HP DL388 CUDA Programming and Performance	7	1133	January 6, 2018
Peer access not supported between devices CUDA Programming and Performance	11	7530	November 9, 2017
Issue with P2P connection using two RTX A4500 CUDA Programming and Performance cuda , ubuntu	7	2666	March 31, 2023
CUDA P2P crash with threadripper CUDA Programming and Performance	5	1179	November 17, 2017

P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled

Related topics