Memory throughput problem between host and device with pinned memory


I am running a tiled Cholesky application on a Linux system with 4 x K40 and I am experiencing really low memory throughput of host <–> device transfers, even using pinned memory with cudaMemcpyAsync and streams.

I use 3 streams for data transfers (one for D2H, one for H2D transfers and one for D2D transfers) and several other streams for kernel launches.

For every kernel, first, the data chunks it needs are asynchronously transferred to the GPU and then, when the transfers are finished, the kernel is launched. All operations are asynchronous and I check for its completion with events. So, I can overlap data transfers with computations.

Launching the application with nvprof and later importing the output file to nvvp, I see the following:

  • When running with only 1 GPU, everything works as expected: kernels overlap with data transfers, and the average throughput for host <--> device data transfers is around 10 GB/s for pinned chunks of 33.5 MB.
  • When splitting the computation into 2 GPUs (I keep the same number of kernels, but this means that additional device to device transfers must be issued and they are actually issued in a separate stream, asynchronously): still data transfers are overlapped with kernels, but the throughput for host <--> device data transfers slows down to only 300 MB/s on average for pinned chunks of 33.5 MB. In this case, I see some data transfers (either H2D, D2H or D2D) that still get around 8 GB/s, which is what I expect, but for some reason, most transfers get a throughput lower than 400 MB/s. nvvp reports memory as pinned and shows the transfers in a stream different than 0. So, I have no idea why this happens.

Any ideas about this throughput slowdown? Does using 2 GPUs have some bad influence on memory bandwidth?

Please, let me know if you need further information, it’s my first post… :-)

Thanks in advance!

“but the throughput for host <–> device data transfers slows down to only 300 MB/s on average for pinned chunks of 33.5 MB”

how do you know this?

perhaps show the profiler output chart showing the time-lines

nvvp tells me the throughput of each transfer as I pass the mouse over them. Also, I can click on the timeline to see the average transfer throughput, the number of transfers, …

I’ll try to attach a couple of screenshots for both 1 GPU and 2 GPU cases.

i probably need to think more on this, but a few dabs so long:

all pci-slots on the mother board are the same? (in a sense, i doubt)
i am generally referring to the lanes

how many host threads do you use when you switch from 1 gpu to 2?

how many contexts does the profiler report for the 2 gpu case? 1 or 2?

There may be a system topology issue. Is this a dual socket server?

As you have 8 devices in your system, I assume you are using more than 8 streams then? Have you set CUDA_DEVICE_MAX_CONNECTIONS to the total number of streams you are using?

To little_jimmy:

  • Yes, all PCI slots are the same.
  • Every GPU has a “dedicated” host thread, so in the case of 2 GPUs, I’m using 2 host threads, each one managing their own GPU.
  • The profiler reports 2 contexts, one for each GPU.

To txbob:

  • Yes, the machine has 2 sockets, the 2 GPU’s I’m using (#0 and #1) are in the same socket, and I have set cudaDeviceEnablePeerAccess() for both.

To tera:

  • I’m using 3 streams for data transfers only:
    • 1 stream for HtD
    • 1 stream for DtH
    • 1 stream for PtoP
      And 8 additional streams for kernels.
  • I’m not using CUDA_DEVICE_MAX_CONNECTIONS, I will try and get back to you.

Many thanks to all of you for your help!

Are you using numactl to control CPU and memory affinity such that each GPU always communicates with the “near” CPU and memory? This is important in dual socket systems as NUMA effects can be quit pronounced.

Agree with njuffa. I suspect a topology issue. You can use numactl to pin your processes to sockets that are topologically “near” to the GPUs you want to use. You can also use taskset, which I find to be simpler semantically to quickly get a read on this.

If your GPUs 0 and 1 are not actually connected to the same socket, then even this won’t sort it out for you - you’ll need to be pretty confident of your understanding of your system topology. If you are absolutely certain they are connected to the same socket, then run your app with:

taskset -c 0 ./my_app

and modify the 0 above to different values up to the CPU core count of your system. You will find the mapping of logical cores to the physical socket that is “closest” to your GPUs.

But if your GPUs are not attached to the same socket, then the above will always yield a situation where one of the GPUs is favored.

From ‘lstopo’ command, I see the following:
Machine (63GB)
__NUMANode L#0 (P#0 31GB)
____Socket L#0 + L3 L#0 (15MB)
____HostBridge L#0
________PCI 10de:1024
__________GPU L#0 “card0”
________PCI 10de:1024
__________GPU L#1 “card1”
__NUMANode L#1 (P#1 31GB)
____Socket L#1 + L3 L#1 (15MB)
____HostBridge L#6
________PCI 10de:1024
__________GPU L#6 “card2”
________PCI 10de:1024
__________GPU L#7 “card3”

Then, I think GPUs #0 and #1 are in the same socket, which is socket #0. Is this correct?

I’m using a custom library to create the threads and pin them to the appropriate core. But actually, I just checked that both threads are binded to cores belonging to socket #1, so this is completely wrong.

Since I’m using several threads and I don’t want to oversubscribe cores, I tried taskset to limit the execution of my application to socket #0 (taskset -c 0-11, as there are 12 cores per socket), instead of taskset -c 0 (it should have the same effect, right?), but I don’t see any memory bandwidth improvement.

Is it possible that the problem relies on the hardware (or any hardware-related issue) rather than on my application? I tried the same application on a similar machine (4 x K40s in the same socket) and the average memory throughput is around 8 GB/s with either 1 or 2 GPUs.

I really appreciate all your help!

Not sitting in front of the system, it is difficult to offer much more than speculation.

What kind of system platform is this? What CPUs are being used, what is the motherboard vendor? I am wondering whether there could be a basic hardware limitation, e.g. insufficient number of PCIe lanes to feed two x16 interfaces per CPU, causing the links to be automatically downgraded when two GPUs are plugged in. Also, there have been issues in the past with various system BIOSes (SBIOS) when multiple Teslas are being used. Is the system running with the latest SBIOS available from the vendor? Have you checked the SBIOS configuration for any signs of possible misconfiguration.

If all efforts of trying to resolve this at the software level fail, you may want to contact your system vendor / system integrator to see if they can give advice on how to optimally configure it for four Teslas in the system.

taskset -c 0-11 may not work the way you expect.

this assumes that the first 12 logical cores belong to the same physical CPU.

But many System BIOSes have settings to select an alternating arrangement:

-c 0 - socket 0 core 0
-c 1 - socket 1 core 0
-c 2 - socket 0 core 1

taskset -c 0-11 may not work the way you expect.

this assumes that the first 12 logical cores belong to the same physical CPU.

But many System BIOSes have settings to select an alternating arrangement:

-c 0 - socket 0 core 0
-c 1 - socket 1 core 0
-c 2 - socket 0 core 1

The instructions I gave, if you followed them carefully, should have identified if this were the case experimentally.

Oh, I see. I will rerun again making sure I’m using the cores of socket 0. Thanks for the observation!

Ok, I will contact the system administrator of the machine, as I’m just a user and I have no way to check everything you said :-( Your speculations are really welcome, thanks!