Memcpy performance on GH200

I’m using nvbandwidth to understand the memcpy performance across NVLINK C2C on a GH200 (900GB/s bi-directional)

Command: ./nvbandwidth --testcase 0 1 2 3 16 17 32 --testSamples 1 --useMean -b 20480

As I understand, the SM version should outperform the performance of the Copy Engines. But in this case I am seeing a drastic drop in performance for buffer sizes 20GB and above. What could be the reason for this?

Hardware details:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   27C    P0             80W /  900W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |

You may like to post this over here where the authors of this blog post may see it: