Hi everyone.
I used this project cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest.cu at master · NVIDIA/cuda-samples · GitHub to measure the device-to-device memory copy (w/ 1 int4 element) latency and got different results in SM(copyp2p) and CE(cudamemcpyPeerAsync) mechanism. Can someone help me explain what makes them take a different amount of latency time.
CE:
GPU | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
0 | 2.72 | 2.8 | 2.66 | 2.66 | 2.66 | 2.66 | 2.66 | 2.65 |
1 | 3.07 | 2.39 | 2.66 | 2.66 | 2.66 | 2.66 | 2.66 | 2.66 |
2 | 3.12 | 2.69 | 2.38 | 2.66 | 2.65 | 2.65 | 2.66 | 2.65 |
3 | 2.99 | 2.67 | 2.74 | 2.38 | 2.66 | 2.66 | 2.66 | 2.66 |
4 | 3.18 | 2.96 | 2.75 | 2.75 | 2.23 | 2.65 | 2.65 | 2.65 |
5 | 3.02 | 2.74 | 2.75 | 2.74 | 2.83 | 2.21 | 2.75 | 2.75 |
6 | 3.26 | 3.02 | 2.84 | 2.8 | 2.75 | 2.76 | 2.32 | 2.76 |
7 | 3.14 | 2.76 | 2.76 | 2.76 | 2.76 | 2.76 | 2.84 | 2.33 |
SM:
GPU | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
0 | 2.88 | 4.5 | 4.39 | 4.38 | 4.39 | 4.39 | 4.38 | 4.39 |
1 | 4.86 | 2.46 | 4.49 | 4.48 | 4.49 | 4.49 | 4.48 | 4.48 |
2 | 4.66 | 4.39 | 2.35 | 4.38 | 4.37 | 4.38 | 4.38 | 4.36 |
3 | 4.91 | 4.61 | 4.59 | 2.44 | 4.59 | 4.59 | 4.58 | 4.59 |
4 | 4.76 | 4.34 | 4.33 | 4.33 | 2.21 | 4.33 | 4.34 | 4.33 |
5 | 5.1 | 4.45 | 4.44 | 4.44 | 4.44 | 2.21 | 4.44 | 4.45 |
6 | 4.96 | 4.46 | 4.46 | 4.46 | 4.46 | 4.46 | 2.23 | 4.45 |
7 | 4.99 | 4.39 | 4.39 | 4.38 | 4.38 | 4.39 | 4.38 | 2.27 |
Hope to see your comment about device-to-device memcpy,
Thanks very much,