Hi everyone,
I am running a dual Titan V setup with a Ryzen 2700X and 32GB DDR4-3200 on a ASUS ROG STRIX X470-F Mainboard. My CUDA version is 10.0 and my driver is 410.79. Operating system is Ubuntu 18.04.
Training deep neural networks on this computer works very nicely as long as I am not distributing one network over both GPUs (training 2 networks in parallel, each on its own GPU, is fine; running one by splitting the minibatch between the GPUs is not. This will result in either a system freeze or both GPUs reporting 100% GPU util in nvidia-smi while not a single batch is being processed).
More curiously, I get very consistent system freezes whenever I try to run simpleP2P from the cuda samples. After a while, my ‘watch nvidia-smi’ terminal will show ERR! for both fan speed and wattage, but I am unable to do anything with the system except pressing the reboot button :-/
I am very confused as to why this happens. I reinstalled cuda and the graphics driver trying to fix this problem - with no success!
I hope you can help me!
Best,
Fabian
Here is some debug information in case you need it:
device query
(dl_venv) fabian@Fabian-ubuntu:~/samples/bin/x86_64/linux/release$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "TITAN V"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 12034 MBytes (12618760192 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1455 MHz (1.46 GHz)
Memory Clock rate: 850 Mhz
Memory Bus Width: 3072-bit
L2 Cache Size: 4718592 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 8 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "TITAN V"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 12037 MBytes (12621381632 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1455 MHz (1.46 GHz)
Memory Clock rate: 850 Mhz
Memory Bus Width: 3072-bit
L2 Cache Size: 4718592 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 9 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from TITAN V (GPU0) -> TITAN V (GPU1) : Yes
> Peer access from TITAN V (GPU1) -> TITAN V (GPU0) : Yes
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 2
Result = PASS
nvidia-smi topo -m
(dl_venv) fabian@Fabian-ubuntu:~/samples/bin/x86_64/linux/release$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity
GPU0 X PHB 0-15
GPU1 PHB X 0-15
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
nvidia-smi before freeze
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:08:00.0 On | N/A |
| 29% 42C P2 36W / 250W | 571MiB / 12034MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V Off | 00000000:09:00.0 Off | N/A |
| 30% 41C P8 24W / 250W | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1288 G /usr/lib/xorg/Xorg 34MiB |
| 0 1326 G /usr/bin/gnome-shell 58MiB |
| 0 1631 G /usr/lib/xorg/Xorg 255MiB |
| 0 1765 G /usr/bin/gnome-shell 124MiB |
| 0 2132 G ...uest-channel-token=14444829368892240909 96MiB |
+-----------------------------------------------------------------------------+
nvidia-smi after freeze
https://www.dropbox.com/s/oi04dqw3ca7j8u0/IMG_20190206_221744.jpg?dl=0
(can’t interact with the computer so I had to take a picture)
simpleP2P output
https://www.dropbox.com/s/r05pq2d17zkqngp/IMG_20190206_221737.jpg?dl=0
(why is the cudaMemcpyPeer command so slow (sub 1 GB/s?)