Jetson TK1: device to device memory copy performance

weiming · May 28, 2015, 10:26am

Hi,

I’ve tested device-to-device memory bandwidth by calling cudaMemcpy(x, x, x, cudaMemcpyDeviceToDevice) 100 times on TK1, and I found the result quite interesting (theoretical peak bandwidth is ~17GB/s):
1 size=1MB bandwidth=1.168792GB/s
2 size=2MB bandwidth=1.089654GB/s
3 size=4MB bandwidth=1.593485GB/s
4 size=8MB bandwidth=2.786712GB/s
5 size=16MB bandwidth=2.833018GB/s
6 size=32MB bandwidth=2.849747GB/s
7 size=64MB bandwidth=10.942277GB/s
8 size=128MB bandwidth=9.162937GB/s
9 size=256MB bandwidth=7.034747GB/s

The memory bandwidth is very low when memory size is smaller than 64MB. And there is a huge performance gap between size=32MB and size=64MB. Besides, the bandwidth is not even a monotonically increasing function of memory size.

I’ve run the same benchmark on Titan Black (peak = ~336GB/s):
1 size=1MB bandwidth=145.536636GB/s
2 size=2MB bandwidth=177.412308GB/s
3 size=4MB bandwidth=197.868988GB/s
4 size=8MB bandwidth=209.278931GB/s
5 size=16MB bandwidth=215.464447GB/s
6 size=32MB bandwidth=218.609253GB/s
7 size=64MB bandwidth=220.606720GB/s
8 size=128MB bandwidth=221.471878GB/s
9 size=256MB bandwidth=222.568848GB/s

Why does Tegra K1 exhibit such strange performance, whereas Titan Black does not?

Thanks in advance.

kulve · May 28, 2015, 11:56am

Are you running with max CPU, GPU and EMC clocks?

[url]http://elinux.org/Jetson/Performance[/url]

weiming · May 29, 2015, 9:18am

yeah…setting the gpu clock manually solves the problem, thanks :)

kulve · May 29, 2015, 9:34am

What results did you get with max clocks? EMC is the memory clock so maxing that out too might make sense (although maxing out GPU clocks might need max EMC clock implicitly anyway…).

Jimmy_Pettersson · May 29, 2015, 8:29pm

Surprisingly you may be able to write kernels that better utilize the GPU bandwidth than the cudaMemcpy D2D API call.

This was at least true in the past.

Topic		Replies	Views
TK1 Memory Bandwidth Jetson TK1	9	5567	June 22, 2014
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11312	July 8, 2009
Jetson TK1 latency too high Jetson TK1	9	6799	November 20, 2014
my speedy Memcpy() CUDA Programming and Performance	9	14955	January 5, 2009
How to achieve highest possible global mem bandwidth? CUDA Programming and Performance	11	7658	January 5, 2009
memCpy : Device to Device VERY SLOW CUDA Programming and Performance	7	2850	September 13, 2009
Slow cudaMemcpy execution Tested in GTX480 and GT240 CUDA Programming and Performance	6	2265	April 25, 2012
Jetson TK1 performance Jetson TK1	18	6464	June 18, 2014
bandwith performance on PCI-E v1 slow? CUDA Programming and Performance	3	882	May 15, 2008
Low bandwidth of memory copy among CPU Jetson TX1	2	501	October 18, 2021

Jetson TK1: device to device memory copy performance

Related topics