I’m using nvbandwidth to understand the memcpy performance across NVLINK C2C on a GH200 (900GB/s bi-directional)
Command: ./nvbandwidth --testcase 0 1 2 3 16 17 32 --testSamples 1 --useMean -b 20480
As I understand, the SM version should outperform the performance of the Copy Engines. But in this case I am seeing a drastic drop in performance for buffer sizes 20GB and above. What could be the reason for this?
Hardware details:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 120GB On | 00000009:01:00.0 Off | 0 |
| N/A 27C P0 80W / 900W | 1MiB / 97871MiB | 0% Default |
| | | Disabled |
