Memcpy performance on Jetson AGX ORIN

make7seven · July 10, 2022, 11:59pm

Hello,

I found low performance in my modules compare to before.

I checked a benchmark of memcpy.

Below table is the results.

(https://gist.githubusercontent.com/lemire/a0cfdafda7448b98fd66f47086ac65f3/raw/2b460ea4cc26597c9c54d3182e689168ce056c7e/copybenchmark.c )

build : gcc -o copybenchmark copybenchmark.c

Xavier	Orin

root@zerofly-desktop:/mnt/nvidia/benchmark# ./copybenchmark	root@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark# ./copybenchmark
copying 1953 MB	copying 1953 MB
time = 0.084000 1190.476207 millions of uints/sec	time = 0.166181 601.753517 millions of uints/sec
[memcpy] time = 0.084000 3124.999852 millions of uints/sec	[memcpy] time = 0.166181 3105.493661 millions of uints/sec
root@zerofly-desktop:/mnt/nvidia/benchmark# ./copybenchmark	root@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark# ./copybenchmark
copying 1953 MB	copying 1953 MB
time = 0.084000 1190.476207 millions of uints/sec	time = 0.162941 613.719099 millions of uints/sec
[memcpy] time = 0.084000 3571.428461 millions of uints/sec	[memcpy] time = 0.162941 3099.237739 millions of uints/sec

[Condition]

MAXN
jetson_clocks
Jeton AGX ORIN (JetPack 5.0 DP)
cat /etc/nv_tegra_release

R34 (release), REVISION: 1.1, GCID: 30414990, BOARD: t186ref, EABI: aarch64, DATE: Tue May 17 04:20:55 UTC 2022

Is this normal in current version(DP)?

DaneLLL · July 11, 2022, 5:27am

Hi,
It is possible you see issues on 5.0.1 DP. Thanks for reporting it and we will check this further.

DaneLLL · July 12, 2022, 1:59am

Hi,
There looks to be a typo in this print:

  float t2 = (after.ru_utime.tv_usec - before.ru_utime.tv_usec) / 1000000.0;
  printf("[memcpy] time = %f  %f millions of uints/sec\n", t2,
         N * NTrials / (t2 * 1000.0 * 1000.0));

Should print out t2 instead of t. Please help check and confirm this.

And please execute sudo jetson_clocks before the profiling.

make7seven · July 13, 2022, 1:22am

Hi,

You are right. There is a bug in that opensource.
I have fixed it.
Then, below table is results.

Xavier	Orin
espresso@zerofly-desktop:/mnt/nvidia/benchmark$ sudo jetson_clocks espresso@zerofly-desktop:/mnt/nvidia/benchmark$ ./copybenchmark copying 1953 MB time = 0.080000 1250.000028 millions of uints/sec [memcpy] time = 0.032000 3124.999852 millions of uints/sec espresso@zerofly-desktop:/mnt/nvidia/benchmark	espresso@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark$ sudo jetson_clocks espresso@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark$ ./copybenchmark copying 1953 MB time = 0.163294 612.392363 millions of uints/sec [memcpy] time = 0.024657 4055.643451 millions of uints/sec espresso@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark$

Exactly, Orin’s NEON accelerator is slow than Xavier. Right?

Thanks.

DaneLLL · July 13, 2022, 1:59am

Hi,
Yes, memcpy() is faster on Orin and vst1q_u32() is slower. We will check with our teams to clarify if it is expected.

DaneLLL · July 18, 2022, 4:42am

Hi,
Internally we have checked performance of memcpy() and it s expected that Orin has better performance. We don’t check vst1q_u32() and don’t have much experience of using it. It seems to be an experimental function per
vst1q_u32 in core::arch::arm - Rust

If calling memcpy() is fine in your use-case, we would suggest call this function instead of calling vst1q_u32()

make7seven · July 18, 2022, 8:19am

Hi,

Okay, I got it.
Thanks.

system · August 10, 2022, 3:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.