DGX-A100, Question of gdr_copy readBW / writeBW

j-seong · June 10, 2021, 5:34am

Hello.

I’m doing DGX-A100 MPI performance evaluation by using OpenMPI 4.1.1 (with UCX 1.10.1).

Benchmark tool is osu_bw (5.7.1)
OS : Ubuntu 20.04
CUDA 11.2 (Driver vresion : 460.73.01)

As you know that UCX supports multiple type of device-to-device communication for NVIDIA GPU, such as cuda_ipc, cuda_copy, gdr_copy.

I have a question regarding gdr_copy performance on the DGX-A100 server.

In my understanding, I thought that gdr_copy D-H (Read) / H-D (Write) assymetric bandwidth is majorly caused by MMIO memory configuration accordingly their installed system architecture.
There are no uncacheable regions on MTRR for GPU memory in the DGX-A100. So, the difference of read bandwidth and write bandwidth is not explainable to me.

Could you describe the details where the BW differences came from?
And, could you let me know when the gdr_copy is beneficial instead of using cuda memcpy?

Please leave comment, if you need additional information or there is unclearified on my question.

copybw shows output as below:

GPU id:0; name: A100-SXM4-40GB; Bus id: 0000:07:00
GPU id:1; name: A100-SXM4-40GB; Bus id: 0000:0f:00
GPU id:2; name: A100-SXM4-40GB; Bus id: 0000:47:00
GPU id:3; name: A100-SXM4-40GB; Bus id: 0000:4e:00
GPU id:4; name: A100-SXM4-40GB; Bus id: 0000:87:00
GPU id:5; name: A100-SXM4-40GB; Bus id: 0000:90:00
GPU id:6; name: A100-SXM4-40GB; Bus id: 0000:b7:00
GPU id:7; name: A100-SXM4-40GB; Bus id: 0000:bd:00
selecting device 0
testing size: 131072
rounded size: 131072
device ptr: 7ffb43200000
map_d_ptr: 0x7ffb657a6000
info.va: 7ffb43200000
info.mapped_size: 131072
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer:0x7ffb657a6000
writing test, size=131072 offset=0 num_iters=10000
write BW: 14567.1MB/s
reading test, size=131072 offset=0 num_iters=100
read BW: 416.767MB/s
unmapping buffer
unpinning buffer
closing gdrdrv

DGX-A100 server shows “cat /proc/mtrr” results as below:

reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size=  512MB, count=1: write-back
reg02: base=0x0a8000000 ( 2688MB), size=   64MB, count=1: write-back
reg03: base=0x0ff000000 ( 4080MB), size=   16MB, count=1: write-protect
reg04: base=0x0a0000000 ( 2560MB), size=  128MB, count=1: write-back
reg05: base=0x0753b0000 ( 1875MB), size=   64KB, count=1: uncachable

Result of nvidia-smi topo -m

Topic		Replies	Views
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3514	March 7, 2011
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11245	July 8, 2009
Drive PX 2: Improve the performance of cudamemcpy HtoD CUDA Programming and Performance	6	1403	March 2, 2018
Quadro GV100 gives so low memory bandwidth CUDA Programming and Performance	12	791	January 6, 2021
Using bandwidthTest tool, D2D performance More than the official given bandwidth CUDA Programming and Performance cuda	6	816	October 28, 2022
Xavier Memory Bandwidth on Pegasus DRIVE AGX Xavier General	8	1400	October 12, 2021
A100 simplemulticopy CUDA Programming and Performance	14	77	August 23, 2024
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	9837	March 24, 2021
GPU bandwidth CUDA Programming and Performance	4	1089	April 20, 2024

DGX-A100, Question of gdr_copy readBW / writeBW

Related topics