Multi-GPU performance incredibly slow

sayantan.knz · December 25, 2019, 11:10pm

I got my hands on a system with two RTX 2080Ti GPUs. Somehow, the performance from the GPU is incredibly slow while starting any program.

For example, if i run device query, it takes around 49 seconds to show me the output. Is there anyway I can make this faster. Am I doing anything wrong?

(base) sayantan@cyan:deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce RTX 2080 Ti"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 11016 MBytes (11551440896 bytes)
  (68) Multiprocessors, ( 64) CUDA Cores/MP:     4352 CUDA Cores
  GPU Max Clock rate:                            1545 MHz (1.54 GHz)
  Memory Clock rate:                             7000 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 5767168 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 23 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce RTX 2080 Ti"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 11019 MBytes (11554717696 bytes)
  (68) Multiprocessors, ( 64) CUDA Cores/MP:     4352 CUDA Cores
  GPU Max Clock rate:                            1545 MHz (1.54 GHz)
  Memory Clock rate:                             7000 Mhz
  Memory Bus Width:                              352-bit
  L2 Cache Size:                                 5767168 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 101 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : No
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS
(base) sayantan@cyan:deviceQuery$

The result from

nvidia-smi topo -m

is as follows:

(base) sayantan@cyan:deviceQuery$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity
GPU0     X      SYS     0-15
GPU1    SYS      X      0-15

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
(base) sayantan@cyan:deviceQuery$

Edit 1:

Even the performance in tensorflow, while I initialize my neural network is incredibly slow. Sometimes it takes like a couple of minutes or even more.
Similarly while starting LibreOffice in ubuntu, it takes like 2~3 minutes, and often crashes my DE altogether.

njuffa · December 26, 2019, 2:55am

That would seem to indicate that the performance issues have nothing to do with the GPUs? What’s a DE, and what are the symptoms of “crashing” the DE?

I assume you have already excluded high system load, lack of available system memory (swapping), and slow disk access as potential root causes?

How much system memory does the host system have? Are you using the persistence daemon to keep the CUDA driver loaded (see: https://docs.nvidia.com/deploy/driver-persistence/index.html)?

Have you asked for help on an Ubuntu-specific forum? In these forums, I notice a specific pattern of issue reports for Linux systems that has been consistent for half a dozen years now: virtually all reports are from users with Ubuntu systems. I have to assume that is not just due to the popularity of this distro.

Robert_Crovella · December 26, 2019, 8:10am

As it happens, libreoffice makes use of the GPU for compute purposes. I don’t know what it does exactly.

I think DE refers to desktop environment, a way of referring to the specific desktop GUI, e.g. gdm.

sayantan.knz · December 26, 2019, 8:13pm

DE stands for Desktop Environment. I shall edit my question to clear it up.

[quote=“njuffa”]

Yes, I have excluded all of it, mostly because when I remove any of the GPU from the 2 GPU setup, things work perfectly. I am particularly not too worried about my desktop environment crashing, but the fact that simple tasks like deviceQuery or memory allocation on GPU, when I initialize tensorflow using

model.compile(optimizer=Adam(lr=1e-4), loss='mse')

takes a considerably long time. And I see no such performance issues when just using a single GPU.

I shall try the persistent daemon and let you know if that helps.

njuffa · December 26, 2019, 10:18pm

That doesn’t make much sense to me. So each GPU works fine when installed by itself? And each works fine in either of the two PCIe slots? I am asking to establish whether there may be a problem that follows a particular GPU or a particular PCIe slot.

You would never want to run without the persistence daemon under Linux. It is a workaround for what I consider an OS design issue, as I do not see justification for any OS to dynamically load and unload GPU drivers.

Are both GPUs installed in a PCIe gen3 x16 slot, have proper power supply (check supplementary PCIe power connectors), and proper cooling?

Supplementary power to the GPUs should not use Y-splitters, 6-to-8-pin adapters, or daisy-chaining. What’s the nominal wattage of the power supply (PSU) in this system? What CPU is used and how much system memory is installed? I am wondering whether the PSU might have insufficient wattage.

The default cooling design of the GeForce RTX series is superior to NVIDIA’s earlier cooling solutions when the cards are used individually, but can cause problems when multiple cards operate in the same system in particular when in close spatial proximity (for a cogent discussion see: https://www.pugetsystems.com/blog/2019/01/11/NVIDIA-RTX-Graphics-Card-Cooling-Issues-1326/ )

sayantan.knz · December 27, 2019, 7:01pm

Yes, they are all connected to Gen3 slots. The system has a 1500W PSU, so I think that shouldn’t be the problem. Although it would be interesting if I could check/confirm that power isn’t the problem.

I have an 8 core CPU, Intel 9800X, which supports PCIe Gen3, and has 44 PCIe lanes. I have a NVMe SSD. But they should all be under the limit of 44 lanes. This computer also has 64GB of RAM, so I don’t really think RAM should be the bottleneck, at least right away.

The ones I have are blower-style ones. Additionally, they have 1 slot gap between them.

I shall open the computer once again, to put back the two cards and try out the persistence daaemon. My hunch is that when I have two GPUs installed, one of them already has the X server running on it, so that driver is loaded, while for the other GPU, it might not be loaded until I start a code which requires the second GPU.

Edit 1:

njuffa · December 27, 2019, 7:08pm

Based on the description, I don’t see any red flags regarding the hardware configuration, so it remains a mystery to me why the system works fine with one GPU but has issues when there are two of them. Based on your system specifications, your power supply is sufficient. Make sure that each GPU is in a x16 slot, not a x4 slot.

Good luck!

sayantan.knz · January 2, 2020, 6:35pm

I got back both GPUs together, ran

sudo nvidia-smi -pm 1

and everything is running as quickly as it should be. It was just an issue of persistenced driver. Although the documentation states without persistence daemon, it would take an excess of a few seconds, for me it took around a few minutes. But anyway, this works great. Thank you so much.

Topic		Replies	Views
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4700	October 22, 2017
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4971	September 20, 2015
Error: Failed to suspend device for CUDA device 0 CUDA Programming and Performance	8	4553	January 4, 2023
CUDA hardware & software CUDA Programming and Performance	9	2666	November 13, 2010
multiGPU poor performance up to 10x lowest performance in multiGPU CUDA Programming and Performance	14	10766	January 18, 2008
CUDA very slow performance CUDA Programming and Performance	21	16764	March 6, 2020
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5723	October 2, 2013
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12291	October 7, 2010
Quadro RTX 8000 Multi-GPU Performance Issue CUDA Programming and Performance	13	1193	March 8, 2025
four 9800GX2 cards: will it work? CUDA Programming and Performance	33	23305	May 28, 2008

Multi-GPU performance incredibly slow

Related topics