Jetson TK1: device to device memory copy performance


I’ve tested device-to-device memory bandwidth by calling cudaMemcpy(x, x, x, cudaMemcpyDeviceToDevice) 100 times on TK1, and I found the result quite interesting (theoretical peak bandwidth is ~17GB/s):
1 size=1MB bandwidth=1.168792GB/s
2 size=2MB bandwidth=1.089654GB/s
3 size=4MB bandwidth=1.593485GB/s
4 size=8MB bandwidth=2.786712GB/s
5 size=16MB bandwidth=2.833018GB/s
6 size=32MB bandwidth=2.849747GB/s
7 size=64MB bandwidth=10.942277GB/s
8 size=128MB bandwidth=9.162937GB/s
9 size=256MB bandwidth=7.034747GB/s

The memory bandwidth is very low when memory size is smaller than 64MB. And there is a huge performance gap between size=32MB and size=64MB. Besides, the bandwidth is not even a monotonically increasing function of memory size.

I’ve run the same benchmark on Titan Black (peak = ~336GB/s):
1 size=1MB bandwidth=145.536636GB/s
2 size=2MB bandwidth=177.412308GB/s
3 size=4MB bandwidth=197.868988GB/s
4 size=8MB bandwidth=209.278931GB/s
5 size=16MB bandwidth=215.464447GB/s
6 size=32MB bandwidth=218.609253GB/s
7 size=64MB bandwidth=220.606720GB/s
8 size=128MB bandwidth=221.471878GB/s
9 size=256MB bandwidth=222.568848GB/s

Why does Tegra K1 exhibit such strange performance, whereas Titan Black does not?

Thanks in advance.

Are you running with max CPU, GPU and EMC clocks?

yeah…setting the gpu clock manually solves the problem, thanks :)

What results did you get with max clocks? EMC is the memory clock so maxing that out too might make sense (although maxing out GPU clocks might need max EMC clock implicitly anyway…).

Surprisingly you may be able to write kernels that better utilize the GPU bandwidth than the cudaMemcpy D2D API call.

This was at least true in the past.