Memcpy performance on Jetson AGX ORIN

Hello,

I found low performance in my modules compare to before.

I checked a benchmark of memcpy.

Below table is the results.

(https://gist.githubusercontent.com/lemire/a0cfdafda7448b98fd66f47086ac65f3/raw/2b460ea4cc26597c9c54d3182e689168ce056c7e/copybenchmark.c )

build : gcc -o copybenchmark copybenchmark.c

Xavier Orin
root@zerofly-desktop:/mnt/nvidia/benchmark# ./copybenchmark root@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark# ./copybenchmark
copying 1953 MB copying 1953 MB
time = 0.084000 1190.476207 millions of uints/sec time = 0.166181 601.753517 millions of uints/sec
[memcpy] time = 0.084000 3124.999852 millions of uints/sec [memcpy] time = 0.166181 3105.493661 millions of uints/sec
root@zerofly-desktop:/mnt/nvidia/benchmark# ./copybenchmark root@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark# ./copybenchmark
copying 1953 MB copying 1953 MB
time = 0.084000 1190.476207 millions of uints/sec time = 0.162941 613.719099 millions of uints/sec
[memcpy] time = 0.084000 3571.428461 millions of uints/sec [memcpy] time = 0.162941 3099.237739 millions of uints/sec

[Condition]

  • MAXN
  • jetson_clocks
  • Jeton AGX ORIN (JetPack 5.0 DP)
  • cat /etc/nv_tegra_release

R34 (release), REVISION: 1.1, GCID: 30414990, BOARD: t186ref, EABI: aarch64, DATE: Tue May 17 04:20:55 UTC 2022

Is this normal in current version(DP)?

Hi,
It is possible you see issues on 5.0.1 DP. Thanks for reporting it and we will check this further.

1 Like

Hi,
There looks to be a typo in this print:

  float t2 = (after.ru_utime.tv_usec - before.ru_utime.tv_usec) / 1000000.0;
  printf("[memcpy] time = %f  %f millions of uints/sec\n", t2,
         N * NTrials / (t2 * 1000.0 * 1000.0));

Should print out t2 instead of t. Please help check and confirm this.

And please execute sudo jetson_clocks before the profiling.

Hi,

You are right. There is a bug in that opensource.
I have fixed it.
Then, below table is results.

Xavier Orin
espresso@zerofly-desktop:/mnt/nvidia/benchmark$ sudo jetson_clocks
espresso@zerofly-desktop:/mnt/nvidia/benchmark$ ./copybenchmark
copying 1953 MB
time = 0.080000 1250.000028 millions of uints/sec
[memcpy] time = 0.032000 3124.999852 millions of uints/sec
espresso@zerofly-desktop:/mnt/nvidia/benchmark
espresso@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark$ sudo jetson_clocks
espresso@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark$ ./copybenchmark
copying 1953 MB
time = 0.163294 612.392363 millions of uints/sec
[memcpy] time = 0.024657 4055.643451 millions of uints/sec
espresso@espresso-desktop:/mnt/nvidia/benchmark/copybenchmark$

Exactly, Orin’s NEON accelerator is slow than Xavier. Right?

Thanks.

Hi,
Yes, memcpy() is faster on Orin and vst1q_u32() is slower. We will check with our teams to clarify if it is expected.

Hi,
Internally we have checked performance of memcpy() and it s expected that Orin has better performance. We don’t check vst1q_u32() and don’t have much experience of using it. It seems to be an experimental function per
vst1q_u32 in core::arch::arm - Rust

If calling memcpy() is fine in your use-case, we would suggest call this function instead of calling vst1q_u32()

1 Like

Hi,

Okay, I got it.
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.