Dear nvidia dev team:
Through the TX2 datasheet v0.91
The memory controller section(page 12)
It said that the memory controller delivering up to 59.7GB/s peak bandwidth.
Could you tell me how to achieve this bandwidth?
No matter which method I choose, I could only get about 23GB/s(23676.94MB/sec) speed in maximum.
sysbench --test=memory --memory-block-size=8M --memory-total-size=128G --num-threads=6 run
Before running, please doing below setting:
- Setting CPUs to max freq
- Setting Max EMC rate
- Disable Tegra CPU Quiet and set current gov to runnable
- Enabling the 2,3,4 cores if not
After set completed and run again. Thanks!
I got this test result by running:
sudo nvpmodel -m 0
any extra stuff I misssed?
We will check this issue for internal and update to you.
How did your sysbench test the bandwidth? Is it doing memory I/O?
In fact, there are still many MC clients have activity. As a result, it has chance underestimating the bandwidth.
sysbench is a standard benchmark tool, you could use apt-get to install it.
Yes, basically it is doing memory I/O.
But I cannot take your advise that the bandwidth is under-estimating due to MC clients have other activity. Because sysbench could designation thread numbers, and no matter how may thread it is running 23GB/s(23676.94MB/sec) is the best result I could get. And it is about half performance as the MC noted in the manual
We also have one 23GB/s result internally, but that is the speed of RAM r/w instead of “memory controller” bandwidth. Thus, I suspect your result just indicates the speed of RAM read/write.
Another internal tool that measures the speed of memory controller gives out a result up to 50GB/s which is near the spec.
Some other things to check:
The Jetson only has 8 GB of RAM, some of which is locked by kernel and GPU. Does that matter, or does this option just tell the tool for how long to run?
Separately, is the tool reporting bandwidth for a read+write operation, or just for a read (or write) operation?
Does the tool get more bandwidth with a single thread, or two threads only?
Is the sysbench tool optimized for AArch64 architecture, or is it actually CPU execute bound rather than memory bound?
Just a note: When testing performance of 1 thread memory operation I found out that changing the memory block size from 8M to 1M can increase the performance by up to 20%. With the increasing number of threads, the block size loses on importance.
May I please know the conclusion of above discussion?
Were you able to measure the current DDR bandwidth being utilised under a load?
I would like to measure it when a neural network inference is being calculated on a Xavier.