While using coherent buffers, a lot of time is spent on reading the data from DDR to the CPU. The issue observed with coherent buffers is mentioned in the below link.
So in order to increase the performance of the system, I have to use cached buffers for DMA transfers.
Userspace application calls the driver mmap call for allocating large buffers (size more than 4kB) and also calls sync_for_cpu/device ioctl calls from userspace for the cache coherency.
The cached buffers are allocated from the driver mmap by using kmalloc as given below.
p_vaddr = kmalloc(buf_size, GFP_KERNEL);
Then dma_map_single(), sets up any required IOMMU mapping and returns the DMA address and kernel virtual address.
p_dma_addr = dma_map_single(&mdev->pdev->dev, p_vaddr,
Then this kernel memory is mapped to userspace using remap_pfn_range.
rv = remap_pfn_range(vma, vma->vm_start,
PFN_DOWN(virt_to_phys(p_vaddr)) + vma->vm_pgoff,
Using the above driver mmap implementation, the userland application could allocate memory more than 4kB size from the kernel and use the buffer to read the frame data.
Once everything is in place, DMA copies frame data to the DMA address (p_dma_addr) and after the dma transfer, sync_for_cpu ioctl is called from the userspace for invalidating the cache.
The implementation of sync_for_cpu ioctl is given below,
After invalidating the cache, the buffer is read from the userspace application.
On TX2, the frame data read from the cached buffer is not correct (data is partially correct). It looks like cache is not invalidated properly. Same implementation is working fine on x86 systems and other arm64 architecture based boards like the i.MX 8MQuad EVK.
Is there any specific implementation needed to be done on TX2?