Poor read/write performance to/from mapped memory on Xavier AGX

Hi,

I have one Xavier AGX connected to a host x86 PC. They are connected with a PCIe connector and use the PCIe x16 external slot of Xavier AGX. The Xavier AGX is configured as an endpoint device (NVIDIA RAM Memory).

In the endpoint function driver (‘pci-epf-nv-test.c’ located in L4T kernel source code (kernel/nvidia/drivers/pci/endpoint/functions directory), necessary codes have been added in order to allocated 256Mo dma memory with dma_alloc_coherent.

Informations of memory allocated with dma_alloc_coherent are exported and used by another pcie driver. In this second driver, dma_mmap_coherent is used to map the memory allocated with dma_alloc_coherent in the ‘pci-epf-nv-test.c’ driver. Then user can access to this memory by calling mmap on the character device created by the second driver.

All work fine, a user application can read and write to/from this mmapped memory. However, copy 256MB from this mmapped memory to a local buffer (allocated with malloc for exemple) has real poor performance (around 78MB/s). Writing from a local buffer (allocated with malloc) to the mmapped memory has also poor performance (around 1.5GB/s). Performance of a copying between two locals buffers is around 6GB/s.

How can I improve the performance of reading/writing to/from the mmaped memory ?

Thanks !

Please, need help. Any suggestion ? Thanks.

Can you share you modified driver files, how you measuring throughput?

Hi, I can share privately with you our modified drivers files and our current test application. How can I send to you privately these files ? Thanks

Hi @RokiaDiarr, You can send files through a private message. Click on the member avatar and hit the “message” button.

image

Best,
Tom

Hi @TomNVIDIA , @omp

Sorry for the late, we were in weekend. I send you our modified drivers files there are few minutes ago. Thank

Hi @TomNVIDIA , @omp

Have you received the files I sent you?

Hi @RokiaDiarr,

Unfortunately I am not a technical resource, @omp will need to look at the file.

Best,
Tom

Hi @omp,

Any update ? Tanks

Sorry for the delay in reply…
We are discussing this issue with our Memory team

Can you please try using dma_alloc_coherent() with dma_alloc_writecombined() and perform an explicit dsb() before letting the data accessed by the userspace code?

Apologies for the delay, we have just returned from a long weekend.
I’m not sure to understand well : I must replace dma_alloc_coherent() by dma_alloc_writecombined() ?
And how can I perform an explicit dsb() ? (dsb() in defined in which file ?). I tried to find on Google, but I don’t found any clear response. Thanks

dsb() can be found in arch/arm64/include/asm/barrier.h file

With dma_alloc_writecombined and subsequent call to dsb(sy) will improve the write performance only.

Another option -
In pci device dt node, add ‘dma-coherent’ property. As Xavier is io-coherent soc, no need to of cache operation if this property is set in dt node.
Did you enable IOMMU for PCI device?