P2P Communication Fails 1080ti->1080ti. IOMMU & ACS disabled

I am having trouble getting P2P communication to work between two 1080ti. Calling the CUDA function

cudaDeviceEnablePeerAccess

causes the GUI to freeze and the program to hang indefinitely. I can, however, execute all the NCCL examples between both GPUs successfully. I first discovered this problem when attempting to distribute training across both GPUs via Tensorflow. After researching the problem further I was able to confirm that it can be replicated with simpleP2P and p2pBandwidthLatencyTest.

I have thoroughly reviewed the existing information on this and have taken the following actions:
- disabled IOMMU in both grub and BIOS
- disabled ACS in BIOS
- upgraded BIOS
- upgraded NCCL via apt

System Specs:
- Ubuntu 19.04
- AMD Ryzen 7 2700X
- ASUS ROG B450-F Gaming Board
- 2 X NVIDIA GTX 1080ti (EVGA) - PCIex16_1 & PCIex16_2
- 32GB 2440MHz RAM
- CUDA 10.1 (Main Install, Other versions on system: 9.0 & 10.0)
- NVIDIA Driver: 418.56
- NCCL: 2.5.6

Output from deviceQuery (v10.1):

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11170 MBytes (11713052672 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 11178 MBytes (11721506816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             5505 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 2883584 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 10 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1080 Ti (GPU0) -> GeForce GTX 1080 Ti (GPU1) : Yes
> Peer access from GeForce GTX 1080 Ti (GPU1) -> GeForce GTX 1080 Ti (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS

Output from

nvidia-smi topo -m
GPU0    GPU1    CPU Affinity
GPU0     X      PHB     0-15
GPU1    PHB      X      0-15

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Output from

nvidia-smi

at rest

Sat Feb 15 10:04:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:09:00.0  On |                  N/A |
| 11%   52C    P5    18W / 250W |    664MiB / 11170MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   36C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2099      G   /usr/lib/xorg/Xorg                           271MiB |
|    0      2856      G   /usr/bin/kwin_x11                             96MiB |
|    0      2861      G   /usr/bin/krunner                               2MiB |
|    0      2863      G   /usr/bin/plasmashell                          51MiB |
|    0      2898      G   /usr/bin/latte-dock                           35MiB |
|    0      3298      G   /usr/bin/akonadi_archivemail_agent             3MiB |
|    0      3307      G   /usr/bin/akonadi_mailfilter_agent              3MiB |
|    0      3311      G   /usr/bin/akonadi_sendlater_agent               3MiB |
|    0      3312      G   /usr/bin/akonadi_unifiedmailbox_agent          3MiB |
|    0      3792      G   ...AAAAAAAAAAAAAAgAAAAAAAAA --shared-files   152MiB |
|    0     21472      G   ...equest-channel-token=935638389533155474    38MiB |
+-----------------------------------------------------------------------------+

Output from NCCL Test

all_reduce_perf -g 2

(v2.5.6)

# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   2343 on     ubuntu device  0 [0x09] GeForce GTX 1080 Ti
#   Rank  1 Pid   2343 on     ubuntu device  1 [0x0a] GeForce GTX 1080 Ti
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608   float     sum    16523    2.03    2.03  0e+00    16525    2.03    2.03  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.03066 
#

Attached is the nvidia-bug-report.sh output file. I had to terminate the script after it hung for 15+ minutes, as a result it may not be complete. I was unable to run the simpleP2P for the nvidia-bug-report.sh because it would freeze my machine and prevent me from using SSH on a different machine. I was, however, able to run the p2pBandwidthLatencyTest and successfully executed the nvidia-bug-report.sh script from a remote shell. To be clear, the p2pBandwidthLatencyTest exhibits nearly the exact same behavior as the simpleP2P program does. The only difference being I was able to access the machine remotely after executing the p2pBandwidthLatencyTest but not with the simpleP2P program.

I am happy to share any information that would help find a solution. Thanks for any and all help!

I corrected this issue with a complete system wipe of all CUDA and NVIDIA driver items. Once the system was completely purged I rebooted and installed NVIDIA driver 418.87 and CUDA 10.1 (Upgrade 2). After doing so, along with disabling IOMMU and ACS I was able to successfully execute all P2P operations.

I am removing sensitive system information (i.e. nvidia-bug-report and dmesg) but will leave the rest of the details up for anyone facing a similar issue.