IOMMU: Unhandled context fault

gilson.varghese · September 23, 2019, 5:28am

I have TX2 board connected to a Xilinx device. I have created a DMA driver to send 32 MB buffer from user space, using scatter gather mapping. When I run the application to send the buffer I get the following error:

arm-smmu 12000000.iommu: Unhandled context fault: iova=0x17fe0fe00, fsynr=0x220003, cb=21, sid=17(0x11 - AFI), pgd=25bec9003, pud=25bec9003, pmd=235086003, pte=0

I checked some previous posts and it seems that iova(0x17fe0fe00) is out of the allocated range? What would be the possible reason? How do I resolve this?

vidyas · September 23, 2019, 6:37am

Please make sure that your driver is adhering to standard Linux PCIe device driver writing model. Particularly when it comes to allocations i.e. using DMA-API.txt is a must.
Please refer to
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
https://www.kernel.org/doc/Documentation/DMA-API.txt

gilson.varghese · September 23, 2019, 9:09am

Hi Vidya,

Thank you for your reply. I have followed the documentation you mentioned, while developing the driver. I have experience in PCIe DMA driver development.

And for your information, driver works in Ubuntu x86 PC. The problem is that it doesn’t work on TX2 board, probably because of ARM64 architecture, SMMU etc. We have solved some other ARM related problems like cache coherency. Since this one seems to be related TX2 SMMU architecture, I request your guidance in solving the same. I am also trying to debug it, with the help of TRM and instrumentation.

There are a few points I want to bring to your notice:

The problem seems to be solved, if I disable SMMU. However, I obviously want to bring DMA up with SMMU
The problem occurs randomly. Sometimes,context fault doesn't show up, if I reboot the board. When the fault happens, the send and received DMA buffers in my DMA loop-back program mismatch.
The problem doesn't occur in x86, which suggests the issue is not with the Xilinx PCIe IP/ device I am using.
I see the issue in many other posts. If possible, could you please try to replicate it at your side?
Could you please send a reference DMA code which you might have used for testing SMMU with L4T 32+? That would be really helpful.

Thanks in advance,
Gilson Varghese

vidyas · September 23, 2019, 10:00am

So far, we have not seen this issue with any upstreamed drivers (xhci_hcd, nvme, r8169, igb, e1000e, ixgbe Etc…). Would it be possible to share the driver with us privately?

RogerE · September 27, 2019, 6:03pm

Hello Gilson,

I was wondering if you have learned anything about the “Unhandled context fault” problem that you can share? Our problem is similar in that our driver works with our hardware on x86, but faults on both the TX2 and Xavier (currently testing with JP 4.2.2 rev1). We also have disabled the MMU as a workaround. Our situation differs in that we use a single, large, long-lived, circular, memory-mapped buffer to receive a constant flow of data via the DMA.

Any tips on how to adapt to the Jetson environment would be greatly appreciated.

Thanks,
RogerE

gilson.varghese · October 1, 2019, 11:47am

Hi Vidya,

I have shared the code in private. Hope you got a chance to look at it.

Hi RogerE,

We could solve the issue by disabling ASPM. Add

pcie_aspm=off

in /boot/extlinux/extlinux.conf.

Please try this and check if it solves the issue.

Regards,
Gilson

RogerE · October 1, 2019, 3:59pm

Hi Gilson,

Thanks. Yes, we discovered it was necessary to disable ASPM. The PCIe card we’re using does not run reliably with ASPM enabled. The symptom we had was random “pcieport” errors in the dmesg log. I can’t find a copy of the specific message right now. Once we disabled ASPM, we would still get the unhandled context faults. We are currently running with the SMMU disabled to avoid the faults until we can fix our driver.

Thanks again,
RogerE

jeba.anandhan · January 24, 2020, 5:35pm

Do you still have the issue?

I don’t think it is range issue. Basically, the context fault implies it is unable to find out the TLB context for that address. SMMU translation logic tries to find out valid context before dive into translation stages.

FSYNR is hint about the error and it will give you detailed about the SMMU failure.

RogerE · January 24, 2020, 6:09pm

Hello jeba.anandhan,

Yes, we still have the unhandled context fault if the SMMU is not disabled. I’ve been on another project and haven’t followed up recently. I’d really like to eventually fix this. It doesn’t make a good impression with our users when the PC installation is so simple compared to Xavier.

These two posts discuss the problems we’ve had:

https://devtalk.nvidia.com/default/topic/1049733/jetson-tx2/ubuntu-pcie-driver-port-to-l4t-gets-unhandled-context-fault/post/5327446/#5327446

https://devtalk.nvidia.com/default/topic/1060880/pcie-dma-driver-compatibility-with-xavier-smmu-iommu-/?offset=2#5373858

Any information related to these problems would be appreciated.

Thanks,
RogerE