Device Memory Bandwidth

Hi, I’m trying to get a better understanding of how host-device transfers affect kernel device memory transfers, and vice versa. For example, on the V100 with a theoretical bandwidth of 900GB/s, I’m assuming that’s 450GB/s in each direction. When I’m transferring from host to device it’s using about 10GB/s in each direction.

  1. Does this come out of the 450GB/s? WHen a kernel writes or reads from device memory, is it affected by the host to device transfers that are happening concurrently?

  2. With nvprof if I use the dram_utilization,dram_read_throughput,dram_write_throughput metrics, it’s per-kernel. Is there a way in nvprof to see how much bandwidth is being used on the host-device transfers?

  3. I’ve seen lots of benchmarks showing that the bandwidthTest is a good proxy for checking the maximum PCIe transfer rate. It also lets you transfer varying sizes of transfers, and obviously the smaller ones are worse. My question is if you have multiple threads doing many small transfers, is the rate still poor? My instinct says yes since it’s the small sizes that have a lot of setup time, but I wasn’t sure if I could get close to the theoretical bandwidth with something like a lot of 2^18 size transfers.

  4. On the V100 I’m seeing significantly lower transfers than the 1080Ti:

Device 0: Tesla V100-PCIE-16GB
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 10968.5

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 7533.3

Device 1: GeForce GTX 1080 Ti
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12074.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12821.7

I tried disabling ECC using nvidia-smi -e 0 on the Tesla, but I still get the same results. How can I improve the V100 throughput?

Actually you should be able to get nearly all of the available bandwidth in a single direction and that is acutally somewhat more efficient than bidirectional transfers, because bus turnaround costs something.

Yes.

??
nvprof --print-gpu-trace lists the effective bandwidth achieved by a H-D or D-H transfer. (Although you won’t be able to collect that on the same run that you are capturing metrics) I don’t understand the question.

A bunch of small transfers will still be less efficient.

11 GB/s bandwidth is pretty close to 12 GB/s bandwidth. But the 7.5 number looks low. What kind of system are these cards plugged into? Is the system certified by the OEM for use with V100 PCIE?

That 7.5 GB/sec number sure looks weird. If you run the app multiple times, do you always get this low result? Is the Tesla V100-PCIE definitely plugged into a PCIe gen3 x16 slot? Is this a properly configured server system with adequate airflow for the passively-cooled Tesla V100?

Assuming the Tesla V100-PCIE and the GTX 1080 Ti are plugged into the same machine, try physically swapping the cards in their slots to see whether changes results.

Thanks txbob and njuffa. I believe the problem on the 7GB/s is because I had a NIC on the same PCIe switch as the V100 transferring data. I disabled that and will run the test again, but I tried it on a different system and saw this:

Device 0: Tesla V100-PCIE-16GB
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12077.2

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12830.2

Thanks for the help!

njuffa or txbob, do you know if there are any tricks to combining data scattered in memory into a single copy? For example, if you have 1000 1024-byte buffers in different parts of memory, it would be nice to combine those into a single copy by index, instead of moving them all into a contiguous staging area. The cudaMemcpy functions don’t appear to have a way to do that, any maybe that’s because it’s just not possible. Or possibly a more important question is will PCIe 4 be supported anytime soon in the CPU<->GPU direction? POWER9 looks promising with NVLINK, but it’s still pretty new and not a whole lot of information is out yet.

“Doctor, it hurts when I push here.” - “Don’t push.”

Avoid scattering data all over GPU memory. High memory throughput requires contiguous accesses, and even when you read through the texture path you want good 2D locality of the data. Build data structures accordingly. You can copy piece-wise contiguous memory with fixed stride using cudaMemcpy2D(), and the strides can be different for source and destination. While that can be a very convenient way of re-organizing data, it will likely be pretty slow: it doesn’t matter whether data is moved by DMA engine or by a program, the data access pattern is what makes or breaks performance.

PCIe is a pipe. NVLINK is a much faster pipe. To first order, that is all a CUDA programmer needs to know. The throughput data for NVLINK (with Power 8) and NVLINK 2.0 (with Power 9) has been published, but I don’t have a link handy. PCIe gen4 will double throughput over PCIe gen3, but products (of any kind) that use it will probably not materialize until late in 2018 and it’s not clear when PCIe gen4 will appear in NVIDIA GPUs. By historical observation, NVIDIA never provides details about future products, except possibly under NDA.

In general, efficient high-performance computing requires minimizing data movement. Data movement of any kind is expensive (time consuming and energy consuming) compared to computation. Minimize it by whatever means available.

cudaMallocPitch() and cudaMemcpy2D()

That works for regular strides. I read the description as desiring randomly scattered storage accessed through an index vector.

Correct. The Pitch/2D versions wouldn’t help in this case.

txbob and njuffa, do either of you understand why gdrcopy performs much worse than cudaMemcpy? I thought the idea behind it is you get lower latency/higher throughput by using the GPUDirect API, but I’m getting worse performance than simply using cudaMemcpy:

gdrcopy:

GPU id:0 name:Tesla V100-PCIE-16GB PCI domain: 0 bus: 97 device: 0
selecting device 0
testing size: 33554432
rounded size: 33554432
device ptr: 7fe377000000
bar_ptr: 0x7fe372fff000
info.va: 7fe377000000
info.mapped_size: 33554432
info.page_size: 65536
page offset: 0
user-space pointer:0x7fe372fff000
BAR writing test, size=33554432 offset=0 num_iters=10000
BAR1 write BW: 9744.98MB/s
BAR reading test, size=33554432 offset=0 num_iters=100
BAR1 read BW: 323.437MB/s

And cudaMemcpy:

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12078.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12827.8

The idea behind gdrcopy is to demonstrate data copying from a device that is not a GPU to a device that is a GPU. It is not intended to be a higher performance replacement for cudaMemcpy for host<->device transfers.

If it were simply higher performance across the board for that case, the CUDA architects would have switched cudaMemcpy to use that method.

Note what is said in the github introduction section:

While GPUDirect RDMA is meant for direct access to GPU memory from third-party devices, it is possible to use these same APIs to create perfectly valid CPU mappings of the GPU memory.

The advantage of a CPU driven copy is the very small overhead involved. That might be useful when low latencies are required.

Your measurements are all bandwidth measurements, not latency measurements.

The PCIE bus does not necessarily have uniform bandwidth for all 4 of these transfer scenarios:

  • H->D driven by host
  • H->D driven by device
  • D->H driven by host
  • D->H driven by device

when considering either a GPU as the “device” or an arbitrary “device”.

In general, cudaMemcpy uses the “by device” methods. gdrcopy uses the “by host” methods. They are not the same, and from a bandwidth perspective, I see no advertising that says gdrcopy is better than a device driven cudaMemcpy (again, if it were the case…)

Thanks! I interpreted that wrong and incorrectly assumed it had higher bandwidth as well.

txbob, does this also imply that the maximum bandwidth we can expect from a third-party device, such as an fpga, to and from a GPU is what gdrcopy shows? I’ve seen a bunch of research papers from various sources trying to maximize throughput, but I never see them getting anywhere near the cudaMemcpy rates.

The author of GDRcopy is the expert in this area, I know relatively little about GPUdirect (and less than txbob, I am reasonably sure).

When you run the bandwidth test app, you are not measuring the throughput of the complete transfer chain between two PCIe devices, just PCIe device <-> system memory, which is not the same.

As txbob points out, GPUdirect is intended for situations where a non-GPU PCIe device communicates with a GPU PCIe device, and in my understanding in many cases the limit on maximum throughput between the devices will be the non-GPU PCIe device.

Compared to the “pedestrian” solution (non-GPU PCIe device -> system memory -> GPU PCIe device) GPUdirect offers better latency even where it would not improve maximum throughput, and that is often the advantage developers are after (I note that for many interfaces, lower latency helps improve effective throughput).

Thanks njuffa. I had just assumed that with modern fpgas that are v3x16 they should easily be able to hit the same rates that the GPU to GPU copies do. Most of the papers seem to indicate that there is something hidden going on where you can’t actually achieve even close to the full rate because of throttling. The last one I read tried the GPU as the master and the fpga as the master, and the results were still fairly poor. I’ve also seen some tests where they do RDMA through mellanox cards from GPU to GPU, and on a 100Gbps adapter it saturates at 80Gbps.

Once you have three entities in play (root complex in the CPU, two PCIe devices of different kind), I would imagine things become quite tricky: the combination of transfer sizes, amount of buffering in each component, plus communication latency make it challenging to reach the theoretical throughput.

Based on past experience working on embedded systems I know how difficult it can be to determine what’s actually limiting performance in such scenarios, even when you have a logic analyzer hooked up to the interconnect. I have not dealt with FPGAs (Xilinx, specifically) since the 1990s, so I cannot speak to interface issues with modern incarnations. You would really want to consult with an expert specialized in interconnects if you need to maximize PCIe throughput with currently available hardware.

You are quoting about 80 Gb/sec for “RDMA through mellanox cards from GPU to GPU”. I take this to mean the even longer chain GPU #1 -> root complex CPU #1 -> IB adapter #1 -> IB adapter #2 -> root complex CPU #2-> GPU #2. If that is what you meant, I would consider achieving an end-to-end throughput of 80 Gb/s a great result.

Hi njuffa, in this case there is a single avago pcie switch which both the mellanox and GPU are on. It should not be traversing the CPU at all. In the case you are talking about where the GPU and Nic are hooked up directly to the CPU I would also expect higher performance since the CPU is technically not part of the transfer. It’s simply acting as a pcie switch. But I don’t have enough experience with how the CPU’s pcie capabilities would act in that case. Certainly if the data actually had to go through the CPU (out to memory or through the cache) I would expect from lower performance.

I am not familiar with the internal architectures of PCIe switches, but presumably such a switch contains multiple such root complex, whereas a CPU typically just contains one? I am still not clear what kind of communication topology you are talking about, maybe draw an ASCII diagram of that?

In any event this starts to go heavily into territory I know nothing about. I don’t know what end goal you are pursuing with this exporation, but it seems you should consult with an expert in modern interconnects.