State-of-the-art GPU -> CPU data transfer techniques

Hi all,

I was wondering if someone could point me to publications or other sources where I could get the most up to date GPU → CPU data transfer techniques available at the moment.

Is there anything else despite pinned memory or hidding latency overlapping data tranfers with computation?

Thank you so much

Are you tackling some specific use case, or is this a generic question?

Use of pinned host memory will maximize the speed of the physical data transfer, while overlap between kernels (and overlap of upstream and downstream traffic, recall that PCIe is a full-duplex interconnect!) will hide the latency. In the ideal case, the copies are entirely overlapped with kernel work through appropriate use of streams, and I have seen multiple real-life applications that came close to that, meaning host/device data transport was a non-issue.

For a practical use case, you would want to make sure that you

(1) are using the fastest interconnect you can afford (currently that would be NVlink on a P100 at up to 25 GB/sec in each direction, followed by PCIe gen3 x16, which provides 12 GB/sec in each direction)

(2) use a GPU with dual DMA engines if you have bi-directional traffic

(3) have the highest system memory performance available (around 60 GB/sec), in particular if you use two GPUs on CPUs with 40 PCIe lanes (going full speed in both directions, two GPUs with PCIe gen 3 x16 interfaces can saturate 50 GB/sec of system memory bandwidth

(4) transfer data in as large chunks as is feasible, to minimize the fixed overhead of packetized data transport on the physical links

Just to add some notes to the item 1 from njuffa:

NVLink is a “new” interconnect which is currently only available on NVIDIA P100 GPUs as well as certain versions of the IBM Power8 CPU.

NVLink is implemented (in both of those devices) as four “bricks” or “links”. A “brick” or “link” provides an aggregate peak theoretical bandwidth of 20GB/s in each direction, simultaneously.

From a hardware design perspective, NVLink is a point-to-point (i.e. not multidrop) bus that is hardware-design configurable to offer 1-4 links between any two points. Since the links are independent, I can have one device that has 4 links, with a hardware design that routes two links to another device A and two links to another device B. There are many other hardware design options possible.

njuffa mentioned 25GB/s, this is reasonable based on another posting I made. However:

  1. That particular number is representative of 2 links, not 4. (due to HW design of IBM S822LC for HPC)
  2. For reasons not clear to me at the moment, that particular number is a little low. It should be about 32GB/s per direction, simultaneously. We regularly witness these levels, and there is something about the particular box I was running on that wasn’t working at full speed. The 32GB/s number (expected) comes about as a consequence of various overheads applied to the 40GB/s (peak theoretical for 2 links), resulting in about ~80% efficiency, which is what we typically see on such a system.

I’d rather that the number 25GB/s not get needlessly propagated, and I guess it was my bad for posting that number. I was attempting to show that the software stack was working at that point, not that it was representative of a fully correct system.

It is more a generic question about how, after a simulation which has been running on a GPU and fitting in memory, the data can be moved back to the SSD drive.

So, ideally one would use:

  1. Pinned host memory to maximize the speed
  2. Asynchronous data transfers via streams

What about double buffering? Could double buffering help in these cases?

@txbob Thanks for the clarifications regarding NVlink.

I did not mean to put NVlink performance in a bad light, but due to the absence of published real-life performance data I used the measured performance data available from the forum post, fully realizing it might represent some sort of lower bound. I should have made that clear. I mainly wanted to avoid creating unrealistic performance expectations as to what can be achieved (better to understate and have the HW overdeliver, than the other way around).

In the future, I will point people at post #3 here if questions about NVlink performance arise. Maybe NVIDIA marketing could consider putting out actual performance numbers as opposed to advertising app-level speedups resulting from replacing PCIe with NVlink , which do not mean much technically. I don’t see how being secretive helps NVIDIA in this regard.

@luiceur If I understand the use case correctly, ping-ponging between two output buffers should do the trick: while the simulation fills result buffer A, previously filled result buffer B is transferred back to the SSD. Then switch the role of the buffers for the next step, rinse and repeat. As long as kernels execute in one non-null CUDA stream and uploads to the host are in a different non-null CUDA stream, you should get perfect overlap.

The write speed of the SSD seems like it could be the biggest potential bottleneck for this use case , so probably best to use an enterprise-level model with write speeds in the GB/sec range.

If PCIE transfer speeds are a bottleneck, it may be possible that your transferred data itself may be better organized to minimize the data transfer itself.

Changing data representation from doubles to floats or even fp16 may help, even if you have to recast them back to wider formats for computation.

Compressing and uncompressing data on either end may be worthwhile… very simple run-length-encoding is very fast on both CPU and GPU and is especially good at reducing sizes of many data types, like matrices with large regions of 0 or constant values.

Another general strategy is to avoid re-sending data the CPU or GPU already knows, for example if a GPU modifies a large data set and sends it back to the CPU, perhaps the changed data could be flagged in coarse “blocks” of whether it’s been modified or not, and only modified data gets resent back to the CPU.

Each one of these strategies involves annoyance of extra coding on both CPU and GPU sides, but if you’re truly limited by transfer speed, they may be useful approaches to reduce or even eliminate the bottleneck by replacing it with compute.

Many of these same techniques can be useful when limited by SSD drive speeds as well by simply reducing the amount of data that must be transferred to and from disk.