Slow memory copy performance - how to set EMC clock?

I’d like to know if setting Max-N with nvpmodel also sets the EMC clock to its max rate?

I’m porting a custom kernel driver from TX1 to TX2. One of its operations is to copy a 2.5MB frame from dma_alloc_coherent memory to user memory.

On the TX1 with the EMC clock set to max frequency, this copy takes 3.5ms. On the TX2 in MAX-N mode, the copy takes almost 11ms.

On the TX1, we set the EMC clock to max with these statements:

cat /sys/kernel/debug/clock/emc/max > /sys/kernel/debug/clock/override.emc/rate
echo 1 > /sys/kernel/debug/clock/override.emc/state

But this file path no longer exists on the TX2, so I’m not sure if I have its EMC clock at max. What is the correct way to max the EMC clock?

You may try the jetson_clocks.sh script, in ubuntu or nvidia user’s home directory:

sudo /home/ubuntu/jetson_clocks.sh --show         # show current clocks
sudo /home/ubuntu/jetson_clocks.sh                # boost clocks
sudo /home/ubuntu/jetson_clocks.sh --show         # show new clocks and check the changes

Thanks, not sure how I missed that… now the 2.5MB copy_to_user() from coherent memory executes in 3.5ms, same as the TX1.

I had expected/hoped this copy time would improve with the TX2, though, as the specs say it has higher memory bandwidth:

Memory
TX2 = 8 GB 128 bit LPDDR4 - 59.7 GB/s
TX1 = 4 GB 64 bit LPDDR4 - 25.6 GB/s

Is it right to expect a memory copy to improve from TX1 to TX2?

Hi,

I used to have the same issue and it’s fixed after applying this change:
https://devtalk.nvidia.com/default/topic/1009011/jetson-tx2/kernel-4-4-drivers-platform-tegra-mc-isomgr-c-isomgr_init-fails-to-initialize/post/5151445/#5151445

Regards,

I have a version of the board with UARTs severed, so the clocks are now all at max. But still, the memory copy performance is identical to the TX1. Is that to be expected? I thought the TX2 would be twice the performance based on the specs.

The operation is copy_to_user() in a kernel driver copying from coherent memory to user space.

I may be way off base here, but it looks as if the compiler ops are not taking advantage of the wider memory bus. Or, there is a choke point somewhere. It looks like the clock rates are the same between the TX1 and TX2.

Indeed, the reported max EMC clocks are similar:

TX1=1600000000
TX2=1866000000

Makes sense that the perf advantage will come from the 128 bit bus, if there is a way to take advantage of it.

Assuming you’re using cached memory, the transfer rate into and out of cache should be double with double the memory width.

If you are using non-cached memory, then each memory transaction takes a fixed amount of time. Using wider instructions (NEON vector instructions, for example) will let you make the most of what you have.
For example, the NEON intrinsics will let you issue 128 bit memory ops at a time:
https://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/ARM-NEON-Intrinsics.html