How to Carry out DMA transfer when sending data using PCIe from NVIDIA Root Port to a custom end point


I am attempting to write a linux kernel module, which will be used to send data from NVIDIA (root port) to a custom FPGA based end point.

So far, I am writing data to BAR0 using memcpy() in user space to send data to the end point. However, this gives me very poor throughput. I created a forum post (PCIe Link Speed Issue), and was told to use DMA.

However, I currently have no idea how I can do that? Do I use DMA to write the data to BAR0 (with BAR0 acting as the DMA destination address)? Or do I do something else? I have searched a lot of documentations, so far I have been unable to find any APIs that will help me do this. (I am new to kernel module writing, so maybe I am missing something obvious here. If anyone could point me to the right direction, I’d be grateful).

I have looked at dmatest.c, but couldn’t identify any APIs that will help me achieve what I want to do. I found this thread (How to Using GPC-DMA MEM2MEM Function), which gives a summarised version of performing MEM2MEM DMA. I successfully managed to do MEM2MEM DMA using this example.

I then attempted to change the destination address in MEM2MEM DMA example to the PCI BAR0 address. However, this results in kernel panic (probably because I cant use BAR0 address directly for DMA? Not sure about this). Maybe I can use BAR0 address as DMA destination address in MEM2MEM if I disable SMMU?

Or is there some other API that will directly allow me to DMA into the PCI BAR region? (I found something regarding MEM2MMIO mode of GPC DMA in Xavier Technical Reference Manual, but have no idea how I can use it in a kernel module).

Note that I am talking about using DMA when sending data from NVIDIA root port to end point, and not about sending data from end point to root port.

Any help would be greatly appreciated.

Note: I am using Jetpack 5.1.

Sana Ur Rehman

Anyone? Any ideas?

Also, how can I disable SMMU (for PCIe Controller 5)? The device tree contains the following lines:

iommus = <&smmu TEGRA_SID_PCIE5>;
		iommu-map = <0x0 &smmu TEGRA_SID_PCIE5 0x1000>;
		iommu-map-mask = <0x0>;

Do I need to remove all 4 lines? Or only the first two, as mentioned here (How to disable SMMU on Xavier? - #3 by WayneWWW)? Is this the only change required to disable SMMU?

Some more info regarding my original query:

  1. I allocate a source buffer using kzalloc().
  2. I obtain DMA source address (bus address) by using dma_map_single() for this source buffer.
  3. I obtain destination buffer (PCI BAR0) by using pci_iomap().
  4. I use dma_map_single() on the BAR0 virtual address obtained in step3 to get DMA destination address.
  5. I perform mem2mem DMA using the method in How to Using GPC-DMA MEM2MEM Function

I run into kernel panic when I run my module. (Probably because The BAR0 virtual address I obtained in step3 isn’t DMA-capable. How can I get a DMA-capable address for BAR0? Will disabling SMMU fix the issue? As bus address = physical address without SMMU?

Since there has been no reply so far, I decided to go ahead and try disabling SMMU. I removed the 4 lines mentioned in comment #2 above, and rebuilt the kernel. Then, I verified that SMMU was indeed disabled for PCIe C5 by checking iommu entry in /sys/kernel/debug. There were entries for other PCIe controllers, but no entry for PCIe C5, confirming that SMMU was indeed disabled.

Next, I tested my kernel module. The module uses the code in this example (How to Using GPC-DMA MEM2MEM Function - #2 by ShaneCCC) , and changes the dma destination address to BAR0 physical address.

However, when I load my module, I get the following errors:

[ 4395.312464] arm-smmu 12000000.iommu: Unhandled context fault: fsr=0x80000402, iova=0x1f40008000, fsynr=0x110011, cbfrsynra=0x820, cb=2
[ 4395.313449] mc-err: vpr base=0:ce000000, size=2a0, ctrl=1, override:(a01a8340, fcee10c1, 1, 0)
[ 4395.313749] mc-err: (255) csw_axisw: MC request violates VPR requirements
[ 4395.313937] mc-err: status = 0x0ff7408d; addr = 0x ffffffff00; hi_adr_reg=0x0
[ 4395.314123] mc-err: secure: yes, access-type: write

How can I resolve this error?

Sana Ur Rehman

In ARM platforms, disabling SMMU (i.e., IOMMU) is not a solution and the linux community itself is against with
Hence don’t disable SMMU

Did you call ioremap() before using the address in the DMA ?

You can also use dma_alloc_coherent() which can allocate a memory and provide 2 pointers in which 1 virtual address can be used by the CPU and the another DMA address can be used by the DMA device

Thanks for the reply @b-sathishkumar .

I didn’t call ioremap(), but I did call pci_iomap(), which, as far as I know, does the same thing as ioremap(), but for PCI BAR.

However, when using pci_iomap(), and then using dma_map_single() or dma_alloc_coherent() results in kernel panic.


Can anyone from NVIDIA comment on this please?


Custom Endpoint hardware should have DMA support and get the DMA programming guidelines from vendor. Then you have to write a client driver by following the guidelines. This is not related to Tegra.


Hi @Manikanta , thanks for the reply.

My original question was regarding using DMA to write data to BAR0. Is this supported in Jetson AGX Xavier? This (DMA to PCIe BARs - #3 by vidyas) is for NX, which says that it isnt supported. Is it the same for AGX Xavier? If it isn’t, what method does Xavier support to perform DMA when sending data from AGX Xavier?

Also, how may I disable SMMU? (this is tegra specific). Is the method mentioned in comment #2 correct? Is disabling SMMU recommended by NVIDIA? I have seen several threads where disabling SMMU was marked as the solution.

I can perform DMA from end point to root port. That’s not the issue, its not what I’m asking for. My query is regarding using DMA engine when sending data from AGX Xavier to my end point. (I can send data easily using memcpy, but would like to use AGX Xavier’s DMA).

I saw something about using ATU to do this, but ran into another issue, which I have asked support for, but got no reply yet (How to Access PCIe ATU Registers). I think the ATU issue is also tegra specific.

Kindly elaborate on the above please.

Sana Ur Rehman


To do DMA transfer, hw should have a DMA engine. Below thread holds good only when you are using Tegra PCIe in EP mode.

PCIe spec doesn’t talk about DMA engine, so industry standards is to have DMA engine as part of EP.

If you are able to do DMA from EP to RP then use same DMA engine to transfer data from Tegra RP to EP using DMA read functionality.

Disabling SMMU is not the solution here and we don’t recommend it. You need to implement DMA programming sequence.


Thanks for the clarification @Manikanta . Using DMA read functionality will allow us to run traffic in half duplex mode. We would like to run full duplex traffic if possible.

Is there no way to use NVIDIA’s DMA to write data to PCIe BAR? Either directly into BAR using BAR address as the destination address, or by using ATU to map the BAR region to a “dma-accessible” destination address? Is there some other DMA supported way in Tegra to do full duplex transfer?

Also, I am confused by the sentence “PCIe root port has a DMA engine build into it”, which would suggest there is an integrated DMA in Tegra RP.

Sana Ur Rehman


There is no other way.

Why would EP run traffic in half duplex mode? Doesn’t it have separate channels for DMA write and read?


Thanks for the reply @Manikanta .

With regards to full duplex, I meant full duplex in the PCIe layer. PCIe read requests generally have a read request, followed by the actual data, and then an acknowledgement packet. This means that while a read transaction is in progress, we can’t use the bus to generate another transaction, thus making the transfer half duplex.

We were aiming to avoid this by generating only write transactions from both the root complex and end point (by making end point the bus master), thus allowing the PCIe bus to be used in full duplex mode.

Can the above be done using Tegra RC?

Sana Ur Rehman


At bus level, PCIe is full duplex. It has separate lanes for rx and tx. In both directions we should hit line rate.


While this is common with mainstream off-the-shelf hardware, in some specialized scenarios, it is not possible to rely on the presence of a DMA engine on the endpoint device. At least on the Orin Technical Reference Manual, the GPC-DMA states that “GPC-DMA engine can copy data from any addressable memory to/from DRAM/SysRAM,” which a PCIe endpoint BAR fulfills as “any addressable memory.” If this doesn’t work as expected, the documentation should be updated to clarify the exclusions to “any addressable memory” vis a vis GPC-DMA.

Additionally, there is indeed a DMA engine inside the PCIe root port (see EDMA here, but it does not appear to have been made available via the main Linux DMA APIs and is used only for driver development internally and “root port DMA performance testing.”

There are a lot of questions on the community here regarding this topic of DMAing data to/from an PCIe endpoint’s memory-mapped BAR using the Tegra/Orin SoC’s DMA facilities and I really think it would benefit Nvidia and the community at large to provide a working example using the root port’s EDMA facility.



xavier pcie dma only two read channels and four write channels?how to change the max numbers?