Different latency results in device-to-device memory copy based on SM and CE mechanisms

Hi everyone.
I used this project cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu at master · NVIDIA/cuda-samples · GitHub to measure the device-to-device memory copy (w/ 1 int4 element) latency and got different results in SM(copyp2p) and CE(cudamemcpyPeerAsync) mechanism. Can someone help me explain what makes them take a different amount of latency time.
CE:

GPU 0 1 2 3 4 5 6 7
0 2.72 2.8 2.66 2.66 2.66 2.66 2.66 2.65
1 3.07 2.39 2.66 2.66 2.66 2.66 2.66 2.66
2 3.12 2.69 2.38 2.66 2.65 2.65 2.66 2.65
3 2.99 2.67 2.74 2.38 2.66 2.66 2.66 2.66
4 3.18 2.96 2.75 2.75 2.23 2.65 2.65 2.65
5 3.02 2.74 2.75 2.74 2.83 2.21 2.75 2.75
6 3.26 3.02 2.84 2.8 2.75 2.76 2.32 2.76
7 3.14 2.76 2.76 2.76 2.76 2.76 2.84 2.33

SM:

GPU 0 1 2 3 4 5 6 7
0 2.88 4.5 4.39 4.38 4.39 4.39 4.38 4.39
1 4.86 2.46 4.49 4.48 4.49 4.49 4.48 4.48
2 4.66 4.39 2.35 4.38 4.37 4.38 4.38 4.36
3 4.91 4.61 4.59 2.44 4.59 4.59 4.58 4.59
4 4.76 4.34 4.33 4.33 2.21 4.33 4.34 4.33
5 5.1 4.45 4.44 4.44 4.44 2.21 4.44 4.45
6 4.96 4.46 4.46 4.46 4.46 4.46 2.23 4.45
7 4.99 4.39 4.39 4.38 4.38 4.39 4.38 2.27

Hope to see your comment about device-to-device memcpy,
Thanks very much,

The CE means it is using the Copy Engine with cudaMemcpyPeerAsync.

SM means it is using a normal kernel for for copying.
The kernel has a comment:

// This kernel is for demonstration purposes only, not a performant kernel for
// p2p transfers.

So it was not meant to be fast.
You could use Nsight Compute to find out, what the bottleneck is.

Dear Curefab,
Thanks for your quick reply.
Yes, you’re right, CE is indeed faster, but I don’t understand the reason for this, whether the latency reduced by DMA for CE compared to SM(LD/ST), or the NVLink packet size?
The reason why I paid attention to SM copy is because I was studying the performance of NCCL communication and found that NCCL kernel is an application for SM mechanism.

I am not sure, why CE is faster.

The likely explanation is that the kernel was not set up in an optimized way.
Perhaps it should use 128-bit instead of 32-bit accesses or something else. That is, why I mentioned Nsight Compute to check.

1 Like