I’m doing DGX-A100 MPI performance evaluation by using OpenMPI 4.1.1 (with UCX 1.10.1).
- Benchmark tool is osu_bw (5.7.1)
- OS : Ubuntu 20.04
- CUDA 11.2 (Driver vresion : 460.73.01)
As you know that UCX supports multiple type of device-to-device communication for NVIDIA GPU, such as cuda_ipc, cuda_copy, gdr_copy.
I have a question regarding gdr_copy performance on the DGX-A100 server.
In my understanding, I thought that gdr_copy D-H (Read) / H-D (Write) assymetric bandwidth is majorly caused by MMIO memory configuration accordingly their installed system architecture.
There are no uncacheable regions on MTRR for GPU memory in the DGX-A100. So, the difference of read bandwidth and write bandwidth is not explainable to me.
Could you describe the details where the BW differences came from?
And, could you let me know when the gdr_copy is beneficial instead of using cuda memcpy?
Please leave comment, if you need additional information or there is unclearified on my question.
copybw shows output as below:
GPU id:0; name: A100-SXM4-40GB; Bus id: 0000:07:00 GPU id:1; name: A100-SXM4-40GB; Bus id: 0000:0f:00 GPU id:2; name: A100-SXM4-40GB; Bus id: 0000:47:00 GPU id:3; name: A100-SXM4-40GB; Bus id: 0000:4e:00 GPU id:4; name: A100-SXM4-40GB; Bus id: 0000:87:00 GPU id:5; name: A100-SXM4-40GB; Bus id: 0000:90:00 GPU id:6; name: A100-SXM4-40GB; Bus id: 0000:b7:00 GPU id:7; name: A100-SXM4-40GB; Bus id: 0000:bd:00 selecting device 0 testing size: 131072 rounded size: 131072 device ptr: 7ffb43200000 map_d_ptr: 0x7ffb657a6000 info.va: 7ffb43200000 info.mapped_size: 131072 info.page_size: 65536 info.mapped: 1 info.wc_mapping: 1 page offset: 0 user-space pointer:0x7ffb657a6000 writing test, size=131072 offset=0 num_iters=10000 write BW: 14567.1MB/s reading test, size=131072 offset=0 num_iters=100 read BW: 416.767MB/s unmapping buffer unpinning buffer closing gdrdrv
DGX-A100 server shows “cat /proc/mtrr” results as below:
reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back reg01: base=0x080000000 ( 2048MB), size= 512MB, count=1: write-back reg02: base=0x0a8000000 ( 2688MB), size= 64MB, count=1: write-back reg03: base=0x0ff000000 ( 4080MB), size= 16MB, count=1: write-protect reg04: base=0x0a0000000 ( 2560MB), size= 128MB, count=1: write-back reg05: base=0x0753b0000 ( 1875MB), size= 64KB, count=1: uncachable
Result of nvidia-smi topo -m