I’ve tested device-to-device memory bandwidth by calling cudaMemcpy(x, x, x, cudaMemcpyDeviceToDevice) 100 times on TK1, and I found the result quite interesting (theoretical peak bandwidth is ~17GB/s):
1 size=1MB bandwidth=1.168792GB/s
2 size=2MB bandwidth=1.089654GB/s
3 size=4MB bandwidth=1.593485GB/s
4 size=8MB bandwidth=2.786712GB/s
5 size=16MB bandwidth=2.833018GB/s
6 size=32MB bandwidth=2.849747GB/s
7 size=64MB bandwidth=10.942277GB/s
8 size=128MB bandwidth=9.162937GB/s
9 size=256MB bandwidth=7.034747GB/s
The memory bandwidth is very low when memory size is smaller than 64MB. And there is a huge performance gap between size=32MB and size=64MB. Besides, the bandwidth is not even a monotonically increasing function of memory size.
I’ve run the same benchmark on Titan Black (peak = ~336GB/s):
1 size=1MB bandwidth=145.536636GB/s
2 size=2MB bandwidth=177.412308GB/s
3 size=4MB bandwidth=197.868988GB/s
4 size=8MB bandwidth=209.278931GB/s
5 size=16MB bandwidth=215.464447GB/s
6 size=32MB bandwidth=218.609253GB/s
7 size=64MB bandwidth=220.606720GB/s
8 size=128MB bandwidth=221.471878GB/s
9 size=256MB bandwidth=222.568848GB/s
Why does Tegra K1 exhibit such strange performance, whereas Titan Black does not?
Thanks in advance.