DGX-A100, Question of gdr_copy readBW / writeBW

Hello.

I’m doing DGX-A100 MPI performance evaluation by using OpenMPI 4.1.1 (with UCX 1.10.1).

  • Benchmark tool is osu_bw (5.7.1)
  • OS : Ubuntu 20.04
  • CUDA 11.2 (Driver vresion : 460.73.01)

As you know that UCX supports multiple type of device-to-device communication for NVIDIA GPU, such as cuda_ipc, cuda_copy, gdr_copy.

I have a question regarding gdr_copy performance on the DGX-A100 server.

In my understanding, I thought that gdr_copy D-H (Read) / H-D (Write) assymetric bandwidth is majorly caused by MMIO memory configuration accordingly their installed system architecture.
There are no uncacheable regions on MTRR for GPU memory in the DGX-A100. So, the difference of read bandwidth and write bandwidth is not explainable to me.

Could you describe the details where the BW differences came from?
And, could you let me know when the gdr_copy is beneficial instead of using cuda memcpy?

Please leave comment, if you need additional information or there is unclearified on my question.

copybw shows output as below:

GPU id:0; name: A100-SXM4-40GB; Bus id: 0000:07:00
GPU id:1; name: A100-SXM4-40GB; Bus id: 0000:0f:00
GPU id:2; name: A100-SXM4-40GB; Bus id: 0000:47:00
GPU id:3; name: A100-SXM4-40GB; Bus id: 0000:4e:00
GPU id:4; name: A100-SXM4-40GB; Bus id: 0000:87:00
GPU id:5; name: A100-SXM4-40GB; Bus id: 0000:90:00
GPU id:6; name: A100-SXM4-40GB; Bus id: 0000:b7:00
GPU id:7; name: A100-SXM4-40GB; Bus id: 0000:bd:00
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: 7ffb43200000
map_d_ptr: 0x7ffb657a6000
info.va: 7ffb43200000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7ffb657a6000
writing test, size=131072 offset=0 num_iters=10000
write BW: 14567.1MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 416.767MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

DGX-A100 server shows “cat /proc/mtrr” results as below:

reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size=  512MB, count=1: write-back
reg02: base=0x0a8000000 ( 2688MB), size=   64MB, count=1: write-back
reg03: base=0x0ff000000 ( 4080MB), size=   16MB, count=1: write-protect
reg04: base=0x0a0000000 ( 2560MB), size=  128MB, count=1: write-back
reg05: base=0x0753b0000 ( 1875MB), size=   64KB, count=1: uncachable

Result of nvidia-smi topo -m