Unstable PCIe bandwidth


We have a code that has some transfers through PCIe, and we detected that one of the biggest has a very unstable bandwidth. Why can this be happening?

Without any information about your system hardware and your application, one can only offer wild guesses. Off the top of my head:

(1) Transfer size. Because PCIe uses packetized transport with fixed overheads, larger block sizes will generally have higher throughput. Block sizes > 16 MB are typically required to achieve maximum throughput. For host->device transfers, small (I think <= 64KB) transfers may be very fast since the data may be embedded into the GPU’s command buffer stream, instead of being transferred by separate DMA command.

(2) The PCIe transfers are from/to pageable memory, rather than pinned, physically contiguous host-side buffers. This may mean that transfers must be broken up into small pieces. Each piece is moved from/to pageable memory to a pinned buffer maintained by the CUSA driver, in addition to DMA transfers between the pinned buffer and the GPU. These host-side copies (within the system memory of the host) can greatly vary in speed.

(3) PCIe transfer throughput between GPU and host system can be influenced and limited by the performance of the system memory controller, if for example other applications are running and consume system memory bandwidth. On some systems a PCIe link running at full speed is able to saturate a good portion of total system memory bandwith.

(4) Multi-socket systems may show reduced throughput for transfers across PCIe if a GPU needs to retrieve/deposit data from/to the “far” CPU with its attached “far” memory. In this case, try to control CPU and memory affinity with a system utility such as numactl.


This is the configuration:

Windows 8 and 10 with WDDM (no TCC, since we also do OpenGL and need a screen plugged)
Visual Studio 2012
CUDA 7.0
Quadro k4200 and Quadro M4000
Compiling in both Debug and Release gives same effect.

The transfer that has unstable transfer rates is part of a peer to peer cuda call, that the runtime implements with two normal transfers, using pinned memory. In fact both Device to Host and Host to Device, are very well overlapped, and last almost the same (when one is larger, the other too).

I don’t have hands-on experience with peer-to-peer transfers across PCIe. For the benefit of people who do and read this thread, could you state what the observed transfer sizes and their respective throughput rates are? What is your definition of “very unstable bandwidth”?

What do you mean by “part” of a peer to peer call? To my knowledge, an actual peer-to-peer access would directly transfer data from one GPU to the other; only where that isn’t possible does the data get copied via the host’s system memory, and in the process it would be subject to impact from items (3) to (5) of the list in my earlier post. If so, the first thing you would probably want to do is find out how to achieve direct GPU to GPU transfers.

About Peer to Peer:

We can’t use direct GPU to GPU transfers, because in Windows it only can happen using the TCC driver on both GPU’s. And we can’t use TCC on both GPU’s because we need OpenGL in at least one of them, which does not work with the TCC driver.

Therefore, the only behavior we can achieve with peer to peer cuda calls, is the one that uses Host memory as an intermediate step. Which is not so terrible, if the bandwidth is decent and Device to Host and Host to Device are overlapped with an small delay.

About “very unstable bandwidth”:

I consider that a variation of “7ms (or less) to 100ms”, for transferring 23,7MB through PCIe, it’s very high. I define this variation as very high instability on the observed bandwidth among different transfers of the same size.

In fact, de difference in bandwidth is measured between different iterations of the same peer to peer call.

About some extra info we gathered about the topic:

We found out, that after hibernating the machine with windows, this kind of things described in the post happen severely and frequently. If the machine did not hibernate since the last fresh start, it happens just a little, and it’s very acceptable for us.

So in a way, we solved the problem by not hibernating our development machines.

I guess that this has something to do with loading and unloading the GPU driver when hibernating.

Hope the information is useful. Could you confirm that hibernation is or not a known issue?

The Windows systems I use generally never hibernate (even the one at home pretty much runs 24/7/365), so I do not have any first hand experience. I cannot think of any mechanism by which a preceding hibernation would have an impact on PCIe transfer rates in subsequently running applications, so I cannot even offer a working hypothesis.

If you can develop a short, simple, complete reproducer code, and you can demonstrate that it behaves correctly before a hibernate event, and incorrectly after resuming from hibernate, then I would suggest that you file a bug a developer.nvidia.com