Cache issue on DMA buffers

Hi,

While using coherent buffers, a lot of time is spent on reading the data from DDR to the CPU. The issue observed with coherent buffers is mentioned in the below link.

So in order to increase the performance of the system, I have to use cached buffers for DMA transfers.

Userspace application calls the driver mmap call for allocating large buffers (size more than 4kB) and also calls sync_for_cpu/device ioctl calls from userspace for the cache coherency.

The cached buffers are allocated from the driver mmap by using kmalloc as given below.

p_vaddr = kmalloc(buf_size, GFP_KERNEL);

Then dma_map_single(), sets up any required IOMMU mapping and returns the DMA address and kernel virtual address.

p_dma_addr = dma_map_single(&mdev->pdev->dev, p_vaddr,
buf_size, DMA_BIDIRECTIONAL);

Then this kernel memory is mapped to userspace using remap_pfn_range.

rv = remap_pfn_range(vma, vma->vm_start,
PFN_DOWN(virt_to_phys(p_vaddr)) + vma->vm_pgoff,
buf_size, vma->vm_page_prot);

Using the above driver mmap implementation, the userland application could allocate memory more than 4kB size from the kernel and use the buffer to read the frame data.

Once everything is in place, DMA copies frame data to the DMA address (p_dma_addr) and after the dma transfer, sync_for_cpu ioctl is called from the userspace for invalidating the cache.

The implementation of sync_for_cpu ioctl is given below,

dma_sync_single_for_cpu(&mdev->pdev->dev, p_dma_addr,
buf_size, DMA_FROM_DEVICE);

After invalidating the cache, the buffer is read from the userspace application.

On TX2, the frame data read from the cached buffer is not correct (data is partially correct). It looks like cache is not invalidated properly. Same implementation is working fine on x86 systems and other arm64 architecture based boards like the i.MX 8MQuad EVK.

Is there any specific implementation needed to be done on TX2?

Hi Flemin,
Please try by pinning the user space task to a specific core and run the test by pinning to an Arm and Denver core separately. Also, please run another test by disabling CONFIG_ARM64_SW_TTBR0_PAN.
Could you share your test code to reproduce the issue along with data about numbers from other boards taken as reference.

Hi sumitg,

I have tried pinning the user space task to a specific core and run the test by pinning to an Arm and Denver core separately, but it didn’t help.
Another test you specified is disabling CONFIG_ARM64_SW_TTBR0_PAN. I couldn’t find CONFIG_ARM64_SW_TTBR0_PAN in the L4T source code. I’m using Jetpack 3.3 with L4T 28.2.1 for Jetson TX2.

Shall I share the test code for reproducing the issue. Could you please mention your email id?

Hi Flemin,
You can attach code to the post in the forum (or) share @ sumitg@nvidia.com to help replicate the problem and check.
‘CONFIG_ARM64_SW_TTBR0_PAN’ is not present in R28. So, no need to disable that. In general, it’s better to use R32 latest release.