Confused about CUDA p2pbandwidthlatency sample

Hi everyone,

I seem to be a little confused about CUDA “p2pbandwidthlatency”.

I installed 3 NVIDIA A40 GPUs without NVLink and ran this example.

The bandwidth of device 0 to 0 is about 640 GB/s. This data is based on official specifications." GPU Memory Bandwidth"?


GPU Memory 48 GB GDDR6 with error-correcting code (ECC)

GPU Memory Bandwidth 696 GB/s


  • NVIDIA NVLink 112.5 GB/s (bidirectional)

  • PCIE Gen4 x16 31.5 GB/s (bidirectional)

NVLink 2-way low profile (2-slot)
Display Ports 3x DisplayPort 1.4*
Max Power Consumption 300 W
Form Factor 4.4" (H) x 10.5" (L) Dual Slot
Thermal Passive
vGPU Software Support NVIDIA vPC/vApps, NVIDIA RTX Virtual Workstation, NVIDIA Virtual Compute Server
vGPU Profiles Supported See the Virtual GPU Licensing Guide
NVENC NVDEC 1x 2x (includes AV1 decode)
Secure and Measured Boot with Hardware Root of Trust Yes
NEBS Ready Level 3
Power Connector 8-pin CPU

In addition, what is the reference direction of “0 to 1” and “0 to 2” Bidirectional P2P Bandwidth?

Reference “Interconnect” PCIe Gen4 x 16?

Yes, that is a measurement of device memory bandwidth, and should be roughly similar to the report provided by bandwidthTest (the 3rd value reported).

Yes, that is going to measure the interconnect bandwidth, with P2P enabled, and conducting transfers simultaneously in both directions.

BTW you have the source code for this code, so you can confirm these yourself.