Cache issue on DMA buffers

Flemin · August 28, 2020, 7:24pm

Hi,

While using coherent buffers, a lot of time is spent on reading the data from DDR to the CPU. The issue observed with coherent buffers is mentioned in the below link.

So in order to increase the performance of the system, I have to use cached buffers for DMA transfers.

Userspace application calls the driver mmap call for allocating large buffers (size more than 4kB) and also calls sync_for_cpu/device ioctl calls from userspace for the cache coherency.

The cached buffers are allocated from the driver mmap by using kmalloc as given below.

p_vaddr = kmalloc(buf_size, GFP_KERNEL);

Then dma_map_single(), sets up any required IOMMU mapping and returns the DMA address and kernel virtual address.

p_dma_addr = dma_map_single(&mdev->pdev->dev, p_vaddr,
buf_size, DMA_BIDIRECTIONAL);

Then this kernel memory is mapped to userspace using remap_pfn_range.

rv = remap_pfn_range(vma, vma->vm_start,
PFN_DOWN(virt_to_phys(p_vaddr)) + vma->vm_pgoff,
buf_size, vma->vm_page_prot);

Using the above driver mmap implementation, the userland application could allocate memory more than 4kB size from the kernel and use the buffer to read the frame data.

Once everything is in place, DMA copies frame data to the DMA address (p_dma_addr) and after the dma transfer, sync_for_cpu ioctl is called from the userspace for invalidating the cache.

The implementation of sync_for_cpu ioctl is given below,

dma_sync_single_for_cpu(&mdev->pdev->dev, p_dma_addr,
buf_size, DMA_FROM_DEVICE);

After invalidating the cache, the buffer is read from the userspace application.

On TX2, the frame data read from the cached buffer is not correct (data is partially correct). It looks like cache is not invalidated properly. Same implementation is working fine on x86 systems and other arm64 architecture based boards like the i.MX 8MQuad EVK.

Is there any specific implementation needed to be done on TX2?

sumitg · September 21, 2020, 11:38am

Hi Flemin,
Please try by pinning the user space task to a specific core and run the test by pinning to an Arm and Denver core separately. Also, please run another test by disabling CONFIG_ARM64_SW_TTBR0_PAN.
Could you share your test code to reproduce the issue along with data about numbers from other boards taken as reference.

Flemin · October 2, 2020, 3:52am

Hi sumitg,

I have tried pinning the user space task to a specific core and run the test by pinning to an Arm and Denver core separately, but it didn’t help.
Another test you specified is disabling CONFIG_ARM64_SW_TTBR0_PAN. I couldn’t find CONFIG_ARM64_SW_TTBR0_PAN in the L4T source code. I’m using Jetpack 3.3 with L4T 28.2.1 for Jetson TX2.

Shall I share the test code for reproducing the issue. Could you please mention your email id?

sumitg · October 27, 2020, 9:05am

Hi Flemin,
You can attach code to the post in the forum (or) share @ sumitg@nvidia.com to help replicate the problem and check.
‘CONFIG_ARM64_SW_TTBR0_PAN’ is not present in R28. So, no need to disable that. In general, it’s better to use R32 latest release.

Topic		Replies	Views
[Jetson TX2] unexpected delay during memory comparison of uncached buffers Jetson TX2 pcie , kernel	3	696	June 17, 2020
PCIe DMA transfer performance issue with custom FPGA board on Jetson TX2 Jetson TX2 pcie , kernel , fpga	2	948	July 12, 2022
Slow CPU access to memory mapped DMA buffer Jetson AGX Xavier	8	2409	October 18, 2021
How to use large size DMA buffer with dma_alloc_coherent Jetson TX2 pcie	3	1464	June 27, 2022
PCIE DMA cache problem on TX2 Jetson TX2 pcie , fpga	2	572	May 24, 2023
allocate DMA at L4T 28.2 Jetson TX1	8	800	October 18, 2021
How to disable SMMU on TX2 platform?? Jetson TX2	10	2347	October 18, 2021
DMA from PCIe Device to Jetson Tx2 local DDR Jetson TX2	8	2377	September 11, 2018
GPU access to memory allocated by dma_alloc_coherent Jetson TX2 cuda	4	1382	October 18, 2021
Is there any DMA Support for I2C in Xavier NX? Jetson Xavier NX i2c	19	846	September 26, 2023

Cache issue on DMA buffers

Related topics