[Jetson TX2] unexpected delay during memory comparison of uncached buffers

Hi,

We face one issue with buffers allocated using dma_alloc_coherent.
We do a loopback test on a custom PCIe hardware module.
At the end of loopback test we do a memcmp of src and dst buffers.
We find that the time for memcmp sometimes increases 8 fold.
Usually for comparison of 8 MB it takes around 24 ms.
But intermittently this shoots up to around 160 ms.

We can reproduce this even if we don’t do any data transfer.

The sequence for this as follows.

  1. Allocate src and dst buffers using drivers mmap call
  2. Fill in the data
  3. Do data comparison
  4. Free the mmap memory

If we repeat the above for 100 iterations, then issue happens like 10 time or so.

The driver mmap is implemented as below. The device is registered as PCIe device.

  1. Allocate the buffer using dma_alloc_coherent as below.
    vaddr = dma_alloc_coherent(&dev, dma_alloc_size, &dma_addr, GFP_KERNEL);
  2. map the memory as below
    dma_mmap_coherent(&dev, vma, vaddr, dma_addr + mmap_pgoff, bytes);

The physical address allocated is 0x80000000 and 0x80900000

Seems like this does not happen if we do the allocation only once and do the memcmp multilpe times.
It seems to happen only if buffer is de-allocated and allocated again.

Can you please provide us some clue on what may be causing the additional delay?

One guess we have is related to DDR burst length used while loading from un-cached space.
This can happen if the DDR burst length used in loading from un-cached space is changed to 8x of the normal value.
Looks like some different parameters are getting used for the load path from DDR to processor register, when the issue happens.

Do you think this is a possibility?

We get the same physical address after every de-alloate and allocate cycle.

Another observation is that it is always 8 times the normal value.
So seems like the additional delay is happening for every transaction in 8 MB.
There is no randomness inside the 8MB block.

Jetson TX2 runs Ubuntu 16.04. Jetson clock script is run at the start.

Thank you
-abdul

Could it be that context switch is taking place in between which is causing this increase in the time? Have you profiled and confirmed that this increase in the time is solely taken by memcmp()?

Yes, the time we are talking about is the time taken for memcmp alone. Even if we do not do any DMA transfer with the buffers, we can reproduce the increase in time. We have to just do the mapping and follow the sequence mentioned above.

Hi Vidyas,

Do you have any other thoughts on this issue?

-abdul