cudaMemcpyDeviceToHost - slow performance using pinned memory

davtir888 · June 24, 2016, 12:40pm

Hi, I was trying to optimize an image processing algorithm on a multi-gpu system consisting of 2 x Tesla K40c. Images are huge ( almost 10 GB each ) and are subdivided in bursts of almost 400 MB each. The algorithm is able to process 2 bursts in parallel (so each GPU has its own dedicated host thread).
In order to optimize data transfers between host and device memory I’ve introduced pinned memory, in particular the input host buffer is allocated using cudaHostAlloc() and flags cudaHostAllocPortable and CudaHostAllocWriteCombined while the output host buffer is allocated only with the flag cudaHostAllocPortable.
The size of the input/output buffers in host memory is equal to the size of 2 consecutive bursts and memcopies are always performed on distinct address spaces.

The problem is that the throughput of the cudaMemcpyHostToDevice is almost 10 GB/s as expected, while those for cudaMemcpyDeviceToHost is only 330 MB/s.
I’ve made some tests (in single GPU mode) using the bandwidthTest application in the Nvidia samples and results are very irregular ( sometimes I get 10 GB/s and sometimes 2 GB/s):

************** TEST 1 **************

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Tesla K40c
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10264.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1944.4

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 183856.9

Result = PASS

************** TEST 2 **************

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Tesla K40c
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10263.0

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10252.9

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 183492.8

Result = PASS

The machine that I’m using mounts Linux 64-bit. Since it is a NUMA architecture, I’ve tried to force the kernel to schedule threads always on the same CPU cores but I got the same results.
I’ve also checked that GPUs are mounted on PCIe 3.0 x16.

Did anyone of you have similar issues or have some idea on how to solve it?

Many thanks

davtir888 · June 24, 2016, 12:42pm

I forgot to mention that I’m using CUDA 7.0.

BulatZiganshin · June 24, 2016, 1:14pm

my first idea is that this includes time required to finish previous operations, in particular kernel executions

pinned memcpy is asynchroonous, so you measure only its own speed. non-pinned memcpy is synchronous so you measure time required to flush the command queue plus memcpy itself

davtir888 · June 24, 2016, 1:46pm

Thanks for the reply.

In general I agree with you, and it is an issue that I’ ve fixed now (unfortunately with no benefits). Infact I think that it does not explain such slow down in performance. Consider that currently I’ve no overlap on kernel executions in the same CUDA context so memcpy[sync] should wait only for last kernel to be executed (one memcpy at 330 MB/s for 450 MB of data takes more than 1 second that is really too much even if it has to wait for the end of the last kernel!).

Thanks again!

davtir888 · June 24, 2016, 1:58pm

Moreover consider that, at the best of my knowledge, in the case of synchronous memcpy on pinned memory I would have expected performance similar to buffers allocated via malloc(), and in this case I have a throughput that is almost 2-3 GB/sec.

Robert_Crovella · June 24, 2016, 2:02pm

I don’t think you’re understanding the comment.

It depends on how you do the timing measurement, and how long the kernel runs prior to the device->host copy, none of which you’ve shown or indicated.

OTOH the bandwidthTest result showing a single run with 10GB/s in one direction and 2GB/s in the other direction is fairly odd.

My first guess would have been process pinning but I think that only makes sense if both measurements were lower, although this is a fairly complex thing to sort out.

If you have a GPU/CPU core affinity issue, you should be able to witness this by using something like:

taskset -c xx ./bandwidthTest

and stepping xx through the available logical cores.

Another possibility that I have seen from time to time is systems where the PCIE link is aggressively power-managed, resulting in lower than optimal transfer speeds. It’s not clear to me that this fits the observation either, however there may be system BIOS settings that can affect this.

Finally, some newer Haswell and beyond based systems have processor FSB snoop settings in the BIOS that affect performance of the PCIE link.

I don’t think there is enough information here to make useful comments, other than random speculation.

davtir888 · June 24, 2016, 2:20pm

I’ ve profiled the application using NVVP so timing measurement does not come from my own code.
The kernel runs for 13 ms before the memcpy call.

However, the taskset suggestion helped me to find CPUs with best affinity. Now I have 10 GB/s of throughput in both directions!

Thank you very very much!

Topic		Replies	Views
Highly variant memcpyAsync bandwidth on Tesla C2050 pinned memory, async memcpy CUDA Programming and Performance	6	4649	October 24, 2011
Tesla m2090 performing as exptected ? CUDA Programming and Performance	2	1080	June 2, 2012
Memory copy improvement ? CUDA Programming and Performance	6	3072	April 25, 2012
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	10945	October 16, 2010
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1985	April 18, 2017
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Low Memory Throughput (D2H) CUDA Programming and Performance	8	2242	May 7, 2014
Slow memory transfers CUDA Programming and Performance	7	1987	May 23, 2011
Memory throughput problem between host and device with pinned memory CUDA Programming and Performance	14	1845	April 28, 2015
Host to Device memcpy overhead CUDA Programming and Performance	2	1144	March 17, 2009

cudaMemcpyDeviceToHost - slow performance using pinned memory

Related topics