Slow write speed of Xavier

Hi,

The tech specs stated that the memory write speed is 137GB/s, but the write speed for my application to memory is much slower. It took 5 ms to write 524288 bytes of std::vector. Could anyone advise on how to increase the write speed into the CPU?

Thanks so much!

Hi user_jay,

Have you maximized the device performance first?

sudo jetson_clocks

Thanks

Hi kayccc,

Tried that but I still get the same results. Thanks.

What does “write 524288 bytes of std::vector” mean?
Did the vector re-allocate?
Did you write streaming data or random access?
Do you understand in general how a computer RAM subsystem is measured and rated? (cache hierarchy, DRAM latencies, bus usage, lazy allocation, virtual memory, and so forth?)

How fast does this program write for you?

#include <vector>
#include <string.h>

int main() {
  std::vector<char> x;
  x.resize(1024*1024*256);
  // make sure linux VM manager isn't using lazy allocation
  memset(&x[0], 0xff, 1024*1024*256);

  // start timing here ->>
  //

  memset(&x[0], 0, 1024*1024*256);

  //
  // <<- stop timing here

  return 0;
}

Hi snarky,

What does “write 524288 bytes of std::vector” mean?

I meant std::vector of size 1024*256, sorry didn’t make myself clear

Did the vector re-allocate?

No

Did you write streaming data or random access?

Streaming data

I’ve tried the script you proposed on nvpmodel mode 0 and 2 (with sudo jetson_clocks), and the times taken are 10-15k ns and 25-30k ns respectively. Was wondering if the write speed can be faster than that of the nvpmodel mode 0 as my application requires a higher write speed?

Also, is it safe to leave the AGX Xavier on nvpmodel mode 0 & max freq permanently?

Thanks so much :)

10 microseconds is alright for writing 256 MB. Multiply by 100,000 to get memory throughput. Seems like one core can write 25 gigabytes per second, according to that simple benchmark.
Yes, it’s safe to use mode 0 max freq all the time. The fan will go on when it’s warm, and the CPU will shut down if it somehow gets to the too-hot state (which never happened to me.)

Also, my desktop Threadripper only has 75 GB/s throughput to its memory subsystem, so getting to one-third of that with a few watts of power and an ARM core seems alright to me. A lot of the memory bandwidth is there to support the GPU, and I imagine you could be able to run more than one core flat-out to get more memory into the CPU (but I don’t know how fat the CPU memory interface is, separate from the GPU)