simpleP2P example and multi-GPU network training causes system freeze and ERR in nvidia-smi

Hi everyone,
I am running a dual Titan V setup with a Ryzen 2700X and 32GB DDR4-3200 on a ASUS ROG STRIX X470-F Mainboard. My CUDA version is 10.0 and my driver is 410.79. Operating system is Ubuntu 18.04.
Training deep neural networks on this computer works very nicely as long as I am not distributing one network over both GPUs (training 2 networks in parallel, each on its own GPU, is fine; running one by splitting the minibatch between the GPUs is not. This will result in either a system freeze or both GPUs reporting 100% GPU util in nvidia-smi while not a single batch is being processed).
More curiously, I get very consistent system freezes whenever I try to run simpleP2P from the cuda samples. After a while, my ‘watch nvidia-smi’ terminal will show ERR! for both fan speed and wattage, but I am unable to do anything with the system except pressing the reboot button :-/
I am very confused as to why this happens. I reinstalled cuda and the graphics driver trying to fix this problem - with no success!
I hope you can help me!
Best,
Fabian

Here is some debug information in case you need it:

device query

(dl_venv) fabian@Fabian-ubuntu:~/samples/bin/x86_64/linux/release$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "TITAN V"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 12034 MBytes (12618760192 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1455 MHz (1.46 GHz)
  Memory Clock rate:                             850 Mhz
  Memory Bus Width:                              3072-bit
  L2 Cache Size:                                 4718592 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 8 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "TITAN V"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 12037 MBytes (12621381632 bytes)
  (80) Multiprocessors, ( 64) CUDA Cores/MP:     5120 CUDA Cores
  GPU Max Clock rate:                            1455 MHz (1.46 GHz)
  Memory Clock rate:                             850 Mhz
  Memory Bus Width:                              3072-bit
  L2 Cache Size:                                 4718592 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 7 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 9 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from TITAN V (GPU0) -> TITAN V (GPU1) : Yes
> Peer access from TITAN V (GPU1) -> TITAN V (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2
Result = PASS

nvidia-smi topo -m

(dl_venv) fabian@Fabian-ubuntu:~/samples/bin/x86_64/linux/release$ nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity
GPU0	 X 	PHB	0-15
GPU1	PHB	 X 	0-15

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

nvidia-smi before freeze

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:08:00.0  On |                  N/A |
| 29%   42C    P2    36W / 250W |    571MiB / 12034MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:09:00.0 Off |                  N/A |
| 30%   41C    P8    24W / 250W |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1288      G   /usr/lib/xorg/Xorg                            34MiB |
|    0      1326      G   /usr/bin/gnome-shell                          58MiB |
|    0      1631      G   /usr/lib/xorg/Xorg                           255MiB |
|    0      1765      G   /usr/bin/gnome-shell                         124MiB |
|    0      2132      G   ...uest-channel-token=14444829368892240909    96MiB |
+-----------------------------------------------------------------------------+

nvidia-smi after freeze
https://www.dropbox.com/s/oi04dqw3ca7j8u0/IMG_20190206_221744.jpg?dl=0

(can’t interact with the computer so I had to take a picture)

simpleP2P output
https://www.dropbox.com/s/r05pq2d17zkqngp/IMG_20190206_221737.jpg?dl=0

(why is the cudaMemcpyPeer command so slow (sub 1 GB/s?)

Make sure iommu is disabled.
https://devtalk.nvidia.com/default/topic/1038209/cuda-programming-and-performance/p2p-access-hangs-the-system-simplep2p-doesn-t-work-/
https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/1

Hi,
thank you very much! Disabling IOMMU in the bios resolved my issue. Could you please elaborate how IOMMU interfers with GPU-GPU communication?
Best,
Fabian

IOMMU includes device isolation and access control on the pcie bus so in order to have two pcie connected devices communicate p2p this has to be properly configured (also by bios) or simply turned off.
Be aware that on some amd chipsets/boards turning off the iommu controller has influence on the function of usb ports.

Thank you for this detailed reply! I am now getting just short of 4GB/s P2P transfer speeds. That is quite a lot lower than I would expect (given that I get 10GB/s on a dual TitanX PC, 20-30GB/s on a dgx1 and 130GB/s on a dgx2). The dual TitanXp workstation has both GPUs in PCIex16 mode while my TitanV’s run in PCIex8 (due to limited PCIe lane on my 2700x). Could that be the reason?

Yes, this is due to the lack of pcie lanes on the one side and also the Titan V is special since it doesn’t support NVLink bridges which would greatly improve p2p speeds. Don’t know if you’re using NVLink bridges on the Titan X system but more pcie lanes simply improve the situation alone. The DGX-1 is using NVLink2, the DGX-2 NVSwitch which vastly raises interconnect speeds.
So the board and cpu you’re using right now isn’t made for SLI like setups and the Titan V also doesn’t bode well for those kind of use cases.

I am aware that I will never get the speeds of dgx1/2 on PCIe, that was just to put it into perspective (and because I was curious what speeds they would be getting).
The TitanX setup is PCIe only, no other interconnect.
Thank you for your help! Very much appreciated!