Highly variant memcpyAsync bandwidth on Tesla C2050 pinned memory, async memcpy

Hi, we’re seeing a really large variance in the bandwidth of memcpyAsync transfers on Tesla C2050 (CUDA4.0, Ubuntu 10.10).

he application copies data from/to host pinned memory and GPU over PCIe in both directions simultaneously, the data rates (for each transfer) vary between 2-5.8GB/sec. Typically, the first and the last transfer are executed at 5-6GB/sec and the rest much lower.

The same application on a different machine (GTX580, Windows) has a steady data rate (5.8GB/sec), but there’s only a single DMA engine there on GTX580 so the comparison is not quite right.

What could be the problem and how to fix it?

This is outside my area of expertise, but I am wondering whether the host system’s memory subsystem provides enough memory bandwidth to serve two PCIe streams of 5.8 GB/sec in addition to the memory bandwidth demands of the host code. What kind of host system is this? With a dual CPU setup I could imagine that NUMA properties may come into play, i.e. DMA transfer speed may differ depending on whether the transfer is to/from the memory coupled to the “near” or the “far” CPU.

By the way, are all the transfers of roughly the same size? Below a certain transfer size the steady-state transfer rate tends to fall below 5.8 GB/sec, which I would attribute to per-transfer overheads.

How to check that?

Btw, we’ve got a relatively recent PC (2010) and the config is simple (1xcpu, 1xgpu):

1xAMD Phenom™ II X4 965 Processor

Asus M4A785TD-V EVO (PCIe 2.0 supported)

1xTesla C2050

If I recall correctly, only the first h2d and the last d2h transfer (which are not overlapped) reach 5.8GB/sec, everything in between goes at around 2-3GB/s rate. The data size is always the same. I played with diff. sizes from 1-32MB, and that doesn’t seem to have much effect on this variance.

You could run the STREAM benchmark to determine the host’s system memory bandwidth. Some information I found on the internet suggests that your setup may be peaking at about 13 GB/sec of achievable system memory bandwidth, which would seem uncomfortably close to the bandwidth required to keep just the bi-directional PCIe DMA traffic going at 5.8 GB/sec per direction. As for the faster transfers at the start and end of the test, I would assume that is simply due to the fact that traffic is unidirectional at the “edges” of the test.

As I said, this is outside my are of expertise. I know from personal experience that the 5.8 GB/sec for unidirectional traffic that you are seeing are in the expected range for a PCIe 2.0 based setup, but I have no experience with measuring bi-directional traffic.

Thanks for your input njuffa, I’ll have a look at the benchmark - any ideas where to find the STREAM for CUDA?

Here’s my mini test for simultaneous bi-directional transfers (attached). Could you (or any volunteers :) please run it?

Please compile it with the flag STREAMS set to:
a) STREAMS 0 (uni-directional transfers, no overlap)
b) STREAMS 1 (simultaneous bi-dir transfers, overlap of H2D and D2H expected if supported by the GPU)
and run it through the profiler. Please post your avg. memcpy time per transfer.

Results:
System: Tesla C2050 GPU (2DMA engines) + PCIe2.0
a) ~700us per 4MB transfer (5.7GB/sec)
b) ~1200us (~3.3GB/sec)
dmatest.cu (7.1 KB)

I changed the test app slightly to get reliable results, and tried it on a three-year old workstation (dual core Xeon 5272) with a C2070:

[a] I changed NE from 4 to 2. The app only uses two streams (upStream, downStream) and uses one event per stream

[b] I increased numGrids from 4 to 50 so it measures mostly steady-state performance (minimizes the impact of startup/shutdown activity)

[c] I introduced an option to execute empty kernels instead of work kernels, so one can focus on the performance of the copies

[d] I added a counter for total bytes copied over PCIe, so we can compute total PCIe bandwidth

As you can see from the results below, with empty kernels and without streams, 419430400 bytes are transferred in 87.151169 ms, for a transfer rate of 4.812 GBytes/sec. With empty kernels plus streams, the same 419430400 bytes are transferred in 47.280704 ms, for a transfer rate of 9.210 Gbytes/sec. So bi-directional transfer results in a 1.91x increase in aggregate PCIe bandwidth. Whether this is the best scaling achievable I do not know, it might be interesting to see data from a state of the art host platform with much higher system memory bandwidth. There is probably also some residual impact on copy performance from executing any kernels at all, so adding a “no kernel” option might be something to try.

From the total run time with working kernels we see that we get overlap with the kernel executions as well, in addition to the overlapping copies. Without streams the difference in runtime between running with empty kernels and working kernels is 4.605 ms, but with streams it is only 1.742 ms, meaning 62% of kernel execution time is overlapped with the bi-directional copies.

Overall the numbers look pretty much as I would expect, with very significant but not perfect overlap.

~ $ ./dmatest

Running 50 kernels on total of N=52428800 elements = 209715200 bytes

gridDataSize = 4194304 bytes

Options:

(PINNED)

(NO STREAMS)

(EMPTY KERNEL)

Elapsed time 260.771973 [ms]

(allocation of host memory)

Elapsed time 1717.254150 [ms]

(randomInit of host memory)

Elapsed time 14.932640 [ms]

(allocation & memset of device memory)

Elapsed time 87.151169 [ms]

(total time (copying + empty kernels))

totalBytesCopied=419430400 bytes

FAILED

~ $ ./dmatest

Running 50 kernels on total of N=52428800 elements = 209715200 bytes

gridDataSize = 4194304 bytes

Options:

(PINNED)

(NO STREAMS)

(WORK KERNEL)

Elapsed time 261.496399 [ms]

(allocation of host memory)

Elapsed time 1719.575439 [ms]

(randomInit of host memory)

Elapsed time 14.961920 [ms]

(allocation & memset of device memory)

Elapsed time 91.756516 [ms]

(total time (copying + kernels))

totalBytesCopied=419430400 bytes

PASSED

~ $ ./dmatest

Running 50 kernels on total of N=52428800 elements = 209715200 bytes

gridDataSize = 4194304 bytes

Options:

(PINNED)

(STREAMS)

(EMPTY KERNEL)

Elapsed time 260.222778 [ms]

(allocation of host memory)

Elapsed time 1717.904419 [ms]

(randomInit of host memory)

Elapsed time 14.954496 [ms]

(allocation & memset of device memory)

Elapsed time 45.538528 [ms]

(total time (copying + empty kernels) with streams)

totalBytesCopied=419430400 bytes

FAILED

~ $ ./dmatest

Running 50 kernels on total of N=52428800 elements = 209715200 bytes

gridDataSize = 4194304 bytes

Options:

(PINNED)

(STREAMS)

(WORK KERNEL)

Elapsed time 260.363464 [ms]

(allocation of host memory)

Elapsed time 1716.765747 [ms]

(randomInit of host memory)

Elapsed time 14.955552 [ms]

(allocation & memset of device memory)

Elapsed time 47.280704 [ms]

(total time (copying + kernels) with streams)

totalBytesCopied=419430400 bytes

PASSED

dmatest.cu (8.46 KB)