How to measure bandwidth from pinned host memory to device memory on aws A100(p4d.24xlarge)?

I want to measure the bandwidth from pinned host memory to device memory on NVIDIA A100. On AWS p4d.24xlarge machine, 8 NVIDIA A100 with PCIe 4.0x16 is supported, so the ideal bandwidth should be 31.5GB/s. But I only get the result of about 13GB/s (from pinned host) by running the below code on NVIDIA developer blog.

code-samples/profile.cu at master · NVIDIA-developer-blog/code-samples · GitHub .

Is there any problem with this code, or any other reason why the speed cannot reach the ideal 31.5GB/S?

Hi there @15801191730 and welcome to the NVIDIA developer forums!

I am definitely not the expert on AWS capabilities so a question to start with: Does AWS guarantee full PCIe x16 support for all 8 GPUs? That would block 128 lanes, but if that AWS instance still uses Intel 8275CL they only seem to support 48 each, giving 96 total?

I might be mistaken in my assessment here, but that would be worth checking I suppose.

Still PCIe x8 should yield around 2GB/s more than what you measured, I am not sure how much that difference can be attributed to overhead. But then again, those code samples are 7 years old.

If this does not help you answer your question then I think the people in the DGX user forum might be better suited to help.

And of course the CUDA team, who might be able to provide a more up to date example of code to measure pinned host memory bandwidth.

You can move your topic to those forum categories if you want.

Thanks!