Question about P2P transfer bandwidth between two RTX2080s

I have a workstation with two RTX2080 GPUs running on Windows 10.
Initially, I used the CUDA bandwidthTest.exe program to conduct separate bandwidth tests on the GPUs. Here are the results:

Next, I used the simpleP2P.exe sample program to test the P2P functionality between the GPUs, without an NVLink connection. The test results are displayed below:

Lastly, I employed the p2pBandwidthLatencyTest.exe to conduct the following test:

Now, I have a few questions:

  1. The simpleP2P test results indicate that Peer-to-Peer access is not supported between the GPUs. So, how is the transmission bandwidth between the two cards displayed in the p2pBandwidthLatencyTest.exe results being transferred? Is it through PCIe?

  2. If the transfer is indeed through PCIe and the GPUs do not support P2P, is the data transfer between them accomplished using cudaMemcpyPeerAsync or through system memory caching, namely, first Device-to-Host (D2H) and then Host-to-Device (H2D)? If this is the case, does the performance of the system memory also impact the transmission bandwidth?

  3. The p2pBandwidthLatencyTest.exe results demonstrate that the Unidirectional test results are nearly identical to the Bidirectional results. Why isn’t the Bidirectional result twice as fast as the Unidirectional result?

Yes.

The test may use that function call. That function call in a non-peer setting still works, it just uses a non-peer transfer path.

Yes, typically a non-peer device-to-device transfer flows through a system memory buffer.

It can, but most modern systems have enough CPU memory bandwidth so that the effect on a single transfer like this, with nothing else going on, is usually not that evident.

There could be a number of factors affecting this, including WDDM batching in the case of GPUs in WDDM mode. System topology could be a factor also. I won’t be able to offer a specific answer.