Memory throughput problem between host and device with pinned memory

judits · April 24, 2015, 12:33pm

Hi,

I am running a tiled Cholesky application on a Linux system with 4 x K40 and I am experiencing really low memory throughput of host <–> device transfers, even using pinned memory with cudaMemcpyAsync and streams.

I use 3 streams for data transfers (one for D2H, one for H2D transfers and one for D2D transfers) and several other streams for kernel launches.

For every kernel, first, the data chunks it needs are asynchronously transferred to the GPU and then, when the transfers are finished, the kernel is launched. All operations are asynchronous and I check for its completion with events. So, I can overlap data transfers with computations.

Launching the application with nvprof and later importing the output file to nvvp, I see the following:

When running with only 1 GPU, everything works as expected: kernels overlap with data transfers, and the average throughput for host <--> device data transfers is around 10 GB/s for pinned chunks of 33.5 MB.

When splitting the computation into 2 GPUs (I keep the same number of kernels, but this means that additional device to device transfers must be issued and they are actually issued in a separate stream, asynchronously): still data transfers are overlapped with kernels, but the throughput for host <--> device data transfers slows down to only 300 MB/s on average for pinned chunks of 33.5 MB. In this case, I see some data transfers (either H2D, D2H or D2D) that still get around 8 GB/s, which is what I expect, but for some reason, most transfers get a throughput lower than 400 MB/s. nvvp reports memory as pinned and shows the transfers in a stream different than 0. So, I have no idea why this happens.

Any ideas about this throughput slowdown? Does using 2 GPUs have some bad influence on memory bandwidth?

Please, let me know if you need further information, it’s my first post… :-)

Thanks in advance!

little_jimmy · April 24, 2015, 2:45pm

“but the throughput for host <–> device data transfers slows down to only 300 MB/s on average for pinned chunks of 33.5 MB”

how do you know this?

perhaps show the profiler output chart showing the time-lines

judits · April 24, 2015, 4:39pm

nvvp tells me the throughput of each transfer as I pass the mouse over them. Also, I can click on the timeline to see the average transfer throughput, the number of transfers, …

I’ll try to attach a couple of screenshots for both 1 GPU and 2 GPU cases.

little_jimmy · April 24, 2015, 7:27pm

i probably need to think more on this, but a few dabs so long:

all pci-slots on the mother board are the same? (in a sense, i doubt)
i am generally referring to the lanes

how many host threads do you use when you switch from 1 gpu to 2?

how many contexts does the profiler report for the 2 gpu case? 1 or 2?

Robert_Crovella · April 24, 2015, 8:18pm

There may be a system topology issue. Is this a dual socket server?

tera · April 24, 2015, 10:09pm

As you have 8 devices in your system, I assume you are using more than 8 streams then? Have you set CUDA_DEVICE_MAX_CONNECTIONS to the total number of streams you are using?

judits · April 27, 2015, 8:32am

To little_jimmy:

Yes, all PCI slots are the same.
Every GPU has a “dedicated” host thread, so in the case of 2 GPUs, I’m using 2 host threads, each one managing their own GPU.
The profiler reports 2 contexts, one for each GPU.

To txbob:

Yes, the machine has 2 sockets, the 2 GPU’s I’m using (#0 and #1) are in the same socket, and I have set cudaDeviceEnablePeerAccess() for both.

To tera:

I’m using 3 streams for data transfers only:
- 1 stream for HtD
- 1 stream for DtH
- 1 stream for PtoP
  And 8 additional streams for kernels.
I’m not using CUDA_DEVICE_MAX_CONNECTIONS, I will try and get back to you.

Many thanks to all of you for your help!

njuffa · April 27, 2015, 8:46am

Are you using numactl to control CPU and memory affinity such that each GPU always communicates with the “near” CPU and memory? This is important in dual socket systems as NUMA effects can be quit pronounced.

Robert_Crovella · April 27, 2015, 11:48am

Agree with njuffa. I suspect a topology issue. You can use numactl to pin your processes to sockets that are topologically “near” to the GPUs you want to use. You can also use taskset, which I find to be simpler semantically to quickly get a read on this.

If your GPUs 0 and 1 are not actually connected to the same socket, then even this won’t sort it out for you - you’ll need to be pretty confident of your understanding of your system topology. If you are absolutely certain they are connected to the same socket, then run your app with:

taskset -c 0 ./my_app

and modify the 0 above to different values up to the CPU core count of your system. You will find the mapping of logical cores to the physical socket that is “closest” to your GPUs.

But if your GPUs are not attached to the same socket, then the above will always yield a situation where one of the GPUs is favored.

judits · April 27, 2015, 2:19pm

From ‘lstopo’ command, I see the following:
Machine (63GB)
__NUMANode L#0 (P#0 31GB)
____Socket L#0 + L3 L#0 (15MB)
______[…]
____HostBridge L#0
______PCIBridge
________PCI 10de:1024
__________GPU L#0 “card0”
______PCIBridge
________PCI 10de:1024
__________GPU L#1 “card1”
__[…]
__NUMANode L#1 (P#1 31GB)
____Socket L#1 + L3 L#1 (15MB)
______[…]
____HostBridge L#6
______PCIBridge
________PCI 10de:1024
__________GPU L#6 “card2”
______PCIBridge
________PCI 10de:1024
__________GPU L#7 “card3”

Then, I think GPUs #0 and #1 are in the same socket, which is socket #0. Is this correct?

I’m using a custom library to create the threads and pin them to the appropriate core. But actually, I just checked that both threads are binded to cores belonging to socket #1, so this is completely wrong.

Since I’m using several threads and I don’t want to oversubscribe cores, I tried taskset to limit the execution of my application to socket #0 (taskset -c 0-11, as there are 12 cores per socket), instead of taskset -c 0 (it should have the same effect, right?), but I don’t see any memory bandwidth improvement.

Is it possible that the problem relies on the hardware (or any hardware-related issue) rather than on my application? I tried the same application on a similar machine (4 x K40s in the same socket) and the average memory throughput is around 8 GB/s with either 1 or 2 GPUs.

I really appreciate all your help!

njuffa · April 27, 2015, 6:05pm

Not sitting in front of the system, it is difficult to offer much more than speculation.

What kind of system platform is this? What CPUs are being used, what is the motherboard vendor? I am wondering whether there could be a basic hardware limitation, e.g. insufficient number of PCIe lanes to feed two x16 interfaces per CPU, causing the links to be automatically downgraded when two GPUs are plugged in. Also, there have been issues in the past with various system BIOSes (SBIOS) when multiple Teslas are being used. Is the system running with the latest SBIOS available from the vendor? Have you checked the SBIOS configuration for any signs of possible misconfiguration.

If all efforts of trying to resolve this at the software level fail, you may want to contact your system vendor / system integrator to see if they can give advice on how to optimally configure it for four Teslas in the system.

Robert_Crovella · April 27, 2015, 8:42pm

taskset -c 0-11 may not work the way you expect.

this assumes that the first 12 logical cores belong to the same physical CPU.

But many System BIOSes have settings to select an alternating arrangement:

-c 0 - socket 0 core 0
-c 1 - socket 1 core 0
-c 2 - socket 0 core 1
…

Robert_Crovella · April 27, 2015, 8:44pm

taskset -c 0-11 may not work the way you expect.

this assumes that the first 12 logical cores belong to the same physical CPU.

But many System BIOSes have settings to select an alternating arrangement:

-c 0 - socket 0 core 0
-c 1 - socket 1 core 0
-c 2 - socket 0 core 1
…

The instructions I gave, if you followed them carefully, should have identified if this were the case experimentally.

judits · April 28, 2015, 8:13am

Oh, I see. I will rerun again making sure I’m using the cores of socket 0. Thanks for the observation!

judits · April 28, 2015, 8:15am

njuffa:

Not sitting in front of the system, it is difficult to offer much more than speculation.

What kind of system platform is this? What CPUs are being used, what is the motherboard vendor? I am wondering whether there could be a basic hardware limitation, e.g. insufficient number of PCIe lanes to feed two x16 interfaces per CPU, causing the links to be automatically downgraded when two GPUs are plugged in. Also, there have been issues in the past with various system BIOSes (SBIOS) when multiple Teslas are being used. Is the system running with the latest SBIOS available from the vendor? Have you checked the SBIOS configuration for any signs of possible misconfiguration.

If all efforts of trying to resolve this at the software level fail, you may want to contact your system vendor / system integrator to see if they can give advice on how to optimally configure it for four Teslas in the system.

Ok, I will contact the system administrator of the machine, as I’m just a user and I have no way to check everything you said :-( Your speculations are really welcome, thanks!

Topic		Replies	Views
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1632	October 30, 2015
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2821	June 24, 2016
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13128	March 30, 2011
Peer access not supported between devices CUDA Programming and Performance	11	7175	November 9, 2017
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5719	October 2, 2013
Pinned memory with multiple socket nodes CUDA Programming and Performance	3	1258	June 7, 2015
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	580	April 6, 2023
Partial fail of peer access in 8 Volta GPU instance (p3.16xlarge) on AWS -> huge slowdown CUDA Programming and Performance	32	3516	March 10, 2018
System uses only 1 out of 4 GPUs at a time on Azure NC instance CUDA Setup and Installation	6	2064	January 1, 2018

Memory throughput problem between host and device with pinned memory

Related topics