Xavier memcpy speed is unstable

I make sure that xavier cpu and ddr clock is working at maximum(ddr is 2133Mhz) using jetson_clocks and tegrastats. But when I copy 4MB data from one buffer to another buffer in kernel, the copy time is unstable. Sometimes it it about 1ms, and sometimes it is about 20ms(too slow). How can I disable cpu and ddr clock dynamic adjustment completely, like DVFS? thanks

sample code like this:
dma_virt[0] = dma_alloc_coherent(NULL, 4 * SZ_1M, &dma_phys[i], GFP_KERNEL);
test_virt = vmalloc(4 * SZ_1M);
test1_virt = vmalloc(4 * SZ_1M);
memset(test_virt, 0, 4 * SZ_1M);
memset(test1_virt, 0, 4 * SZ_1M);
for (i = 0; i < 360; i++) {
dma_start_time = ktime_get();
memcpy(test_virt, dma_virt[0], 4 * SZ_1M);
dma_end_time = ktime_get();

	dma1_start_time = ktime_get();
	memcpy(test_virt, test1_virt, 4 * SZ_1M);
	dma1_end_time = ktime_get();
	
	printk(KERN_ERR "length=%u iommu_to_buf=%lld, buf_to_buf=%lld\n", 4 * SZ_1M,
		ktime_to_ns(dma_end_time) - ktime_to_ns(dma_start_time),
		ktime_to_ns(dma1_end_time) - ktime_to_ns(dma1_start_time));
}

output like this(time uint is ns):
[11609.757225] length=4194304 iommu_to_buf=15583232, buf_to_buf=519904
[11609.773674] length=4194304 iommu_to_buf=15842048, buf_to_buf=471392
[11609.789595] length=4194304 iommu_to_buf=15266816, buf_to_buf=488000
[11609.806189] length=4194304 iommu_to_buf=15972224, buf_to_buf=483456
[11609.822609] length=4194304 iommu_to_buf=15805408, buf_to_buf=475168
[11609.839206] length=4194304 iommu_to_buf=15990144, buf_to_buf=472224
[11609.855744] length=4194304 iommu_to_buf=15926176, buf_to_buf=475840
[11609.872415] length=4194304 iommu_to_buf=16070528, buf_to_buf=466368
[11609.889057] length=4194304 iommu_to_buf=15971648, buf_to_buf=538080
[11609.903147] length=4194304 iommu_to_buf=13558144, buf_to_buf=396000
[11609.916641] length=4194304 iommu_to_buf=12885440, buf_to_buf=472992
[11609.932401] length=4194304 iommu_to_buf=15135456, buf_to_buf=476800
[11609.948966] length=4194304 iommu_to_buf=15903680, buf_to_buf=507552
[11609.965593] length=4194304 iommu_to_buf=15989568, buf_to_buf=470144
[11609.982254] length=4194304 iommu_to_buf=16034944, buf_to_buf=463840
[11609.998853] length=4194304 iommu_to_buf=16000000, buf_to_buf=465376
[11610.015526] length=4194304 iommu_to_buf=16066240, buf_to_buf=472928
[11610.032190] length=4194304 iommu_to_buf=16057664, buf_to_buf=469440
[11610.048737] length=4194304 iommu_to_buf=15946976, buf_to_buf=466816
[11610.062698] length=4194304 iommu_to_buf=13437376, buf_to_buf=391328

Please take a look at Clock Frequency and Power Management section to see if helps.

hi, vickyy,
It seems that the ddr clock is right. But the speed of dma_buf–>vmalloc_buf is about 15 times slower than vmalloc_buf–>vmalloc_buf and it is lower than one channel ddr speed. How to explain it? thanks~

vmalloc() creates cacheable mapping but dma_alloc_coherent() creates non-cacheable mapping. Reading from non-cacheable mapping is slower compared to cacheable mapping.

But non-cacheable mapping is too slow(about 0.2GB/s), which is slower than DDR4 single channel bandwith. How to explain it? thanks

By referring to our internal results, the number seems reasonable.
Have you done the test on any other platform?

Correction. Your number is quite low compared to our result.
Will check internally and update you here.

Hi, vickyy,
I find that memory allocated by dma_alloc_coherent with cma is very slow and the speed is unstable.

Hi vickyy,
Is there any update?

Does dynamic frequency scaling help?

Hi vickyy,
Cpu and EMC is working on the maximum frequnency.Buffer with CMA is very slow, but other is ok.

Hi 756948396,

This is still under internal check. Will update you once progress.

Your result for the bandwidth from non-cacheable temporal buffer to cacheable buffer is not super surprising. dma_alloc_coherent for NULL device is discouraged. For NULL device IO MMU will not be kicked in as well.

What are you experimenting for? Probably, explicit invalidating the data from DRAM before reading and writing to CPU cached address may speed up the “dma_addr” to “CPU addr” data transfer. Could you use “dma_sync_single_for_cpu” though it works for actual devices for IOMMU?

Hi, vickyy,
The reason of using dma_alloc_coherent with NULL device is that I want to get physical address to map to PCIE bar0 with IOMMU. The buffer allocated by dma_alloc_coherent with pcie device can not be mapped by IOMMU again.

The bandwith of one Xavier DDR channel is about 16GB/s. I think non-cacheable buffer should reach the bandwith, but it does not. How to explain it? thanks

To read from non-cacheable memory at full speed, you have to issue streaming reads, which may be hard to do without inline assembly accessing the SIMD registers.
I wouldn’t assume that memcpy() knows about non-cacheable memory and its special needs.
(In fact, I don’t even know if the Carmel core has the necessary optimizations to stream full pages even when data is not cached – it would be nice if it did, but there are no guarantees.)