cudaMemcpyDeviceToHost - slow performance using pinned memory

Hi, I was trying to optimize an image processing algorithm on a multi-gpu system consisting of 2 x Tesla K40c. Images are huge ( almost 10 GB each ) and are subdivided in bursts of almost 400 MB each. The algorithm is able to process 2 bursts in parallel (so each GPU has its own dedicated host thread).
In order to optimize data transfers between host and device memory I’ve introduced pinned memory, in particular the input host buffer is allocated using cudaHostAlloc() and flags cudaHostAllocPortable and CudaHostAllocWriteCombined while the output host buffer is allocated only with the flag cudaHostAllocPortable.
The size of the input/output buffers in host memory is equal to the size of 2 consecutive bursts and memcopies are always performed on distinct address spaces.

The problem is that the throughput of the cudaMemcpyHostToDevice is almost 10 GB/s as expected, while those for cudaMemcpyDeviceToHost is only 330 MB/s.
I’ve made some tests (in single GPU mode) using the bandwidthTest application in the Nvidia samples and results are very irregular ( sometimes I get 10 GB/s and sometimes 2 GB/s):

************** TEST 1 **************

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Tesla K40c
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10264.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1944.4

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 183856.9

Result = PASS

************** TEST 2 **************

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Tesla K40c
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10263.0

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10252.9

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 183492.8

Result = PASS


The machine that I’m using mounts Linux 64-bit. Since it is a NUMA architecture, I’ve tried to force the kernel to schedule threads always on the same CPU cores but I got the same results.
I’ve also checked that GPUs are mounted on PCIe 3.0 x16.

Did anyone of you have similar issues or have some idea on how to solve it?

Many thanks

I forgot to mention that I’m using CUDA 7.0.

my first idea is that this includes time required to finish previous operations, in particular kernel executions

pinned memcpy is asynchroonous, so you measure only its own speed. non-pinned memcpy is synchronous so you measure time required to flush the command queue plus memcpy itself

Thanks for the reply.

In general I agree with you, and it is an issue that I’ ve fixed now (unfortunately with no benefits). Infact I think that it does not explain such slow down in performance. Consider that currently I’ve no overlap on kernel executions in the same CUDA context so memcpy[sync] should wait only for last kernel to be executed (one memcpy at 330 MB/s for 450 MB of data takes more than 1 second that is really too much even if it has to wait for the end of the last kernel!).

Thanks again!

Moreover consider that, at the best of my knowledge, in the case of synchronous memcpy on pinned memory I would have expected performance similar to buffers allocated via malloc(), and in this case I have a throughput that is almost 2-3 GB/sec.

I don’t think you’re understanding the comment.

It depends on how you do the timing measurement, and how long the kernel runs prior to the device->host copy, none of which you’ve shown or indicated.

OTOH the bandwidthTest result showing a single run with 10GB/s in one direction and 2GB/s in the other direction is fairly odd.

My first guess would have been process pinning but I think that only makes sense if both measurements were lower, although this is a fairly complex thing to sort out.

If you have a GPU/CPU core affinity issue, you should be able to witness this by using something like:

taskset -c xx ./bandwidthTest

and stepping xx through the available logical cores.

Another possibility that I have seen from time to time is systems where the PCIE link is aggressively power-managed, resulting in lower than optimal transfer speeds. It’s not clear to me that this fits the observation either, however there may be system BIOS settings that can affect this.

Finally, some newer Haswell and beyond based systems have processor FSB snoop settings in the BIOS that affect performance of the PCIE link.

I don’t think there is enough information here to make useful comments, other than random speculation.

I’ ve profiled the application using NVVP so timing measurement does not come from my own code.
The kernel runs for 13 ms before the memcpy call.

However, the taskset suggestion helped me to find CPUs with best affinity. Now I have 10 GB/s of throughput in both directions!

Thank you very very much!