cudaMemcpyPeerAsync behavior for different hardware

That’s abnormal.

That should not be necessary. The test is designed to run in less than a minute, as-is, on a proper modern platform (try it unmodifed on your DGXA100 system if you like.)

The results above marked with an asterisk, for example, are a problem/unexpected. I think the platform is suspect. If you ordered it configured exactly like this from a reputable OEM, then you should take these results to them and ask for resolution. That isn’t something we can do here, nor can NVIDIA fix your platform.

If this is a platform you built yourself, then I would suspect that you have started with a motherboard setup which was not properly designed for this activity.

I don’t think I will be able to help further with a platform issue, but this topic may be of interest.