Unhandled context fault when fio over SSD7505 On JETSON AGX XAVIER

lspci


0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:04.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:0c.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:1c.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:03:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:10.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:14.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:18.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:1c.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:07:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:09:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:0a:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:0b:00.0 RAID bus controller: HighPoint Technologies, Inc. Device 7505 (rev 01)

uname -a

Linux test-desktop 4.9.201-tegra #1 SMP PREEMPT Fri Jul 9 08:56:59 PDT 2021 aarch64 aarch64 aarch64 GNU/Linux

Issue:

When running fio over NVMe RAID(s) attached to SSD7505:

fio --filename=/dev/hptblock2n2p --direct=1 --rw=read --ioengine=libaio --bs=128k --iodepth=64 --numjobs=8 --runtime=4h --time_base=1 --group_reporting --name=read

The kernel reports following messages occasionally:
[75223.684557] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4ece40000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=8543a4003, pte=700007ee920f47
[75223.691068] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f60a0000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=854328003, pte=700007e5ca0f47
[75224.403069] irq 102: nobody cared (try booting with the “irqpoll” option)
[75224.403219] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.9.201-tegra #1
[75224.403339] Hardware name: Jetson-AGX (DT)
[75224.403407] Call trace:
[75224.403458] [] dump_backtrace+0x0/0x198
[75224.403546] [] show_stack+0x24/0x30
[75224.403631] [] dump_stack+0xa0/0xc8
[75224.403720] [] __report_bad_irq+0x3c/0xf8
[75224.403811] [] note_interrupt+0x2c8/0x318
[75224.403902] [] handle_irq_event_percpu+0x50/0x60
[75224.404000] [] handle_irq_event+0x50/0x80
[75224.404090] [] handle_fasteoi_irq+0xd4/0x1c0
[75224.404184] [] generic_handle_irq+0x34/0x50
[75224.404593] [] __handle_domain_irq+0x68/0xc0
[75224.405052] [] gic_handle_irq+0x5c/0xb0
[75224.405471] [] el1_irq+0xe8/0x194
[75224.405843] [] irq_exit+0xd0/0x118
[75224.406219] [] __handle_domain_irq+0x6c/0xc0
[75224.411000] [] gic_handle_irq+0x5c/0xb0
[75224.416078] [] el1_irq+0xe8/0x194
[75224.421152] [] cpuidle_enter_state+0xb8/0x380
[75224.426753] [] cpuidle_enter+0x34/0x48
[75224.432267] [] call_cpuidle+0x44/0x70
[75224.437685] [] cpu_startup_entry+0x1b0/0x200
[75224.443296] [] rest_init+0x84/0x90
[75224.448106] [] start_kernel+0x370/0x384
[75224.453616] [] __primary_switched+0x80/0x94
[75224.459471] handlers:
[75224.461750] [] tegra_mcerr_hard_irq threaded [] tegra_mcerr_thread
[75224.471370] Disabling IRQ #102
[75224.474511] mc-err: vpr base=0:c6000000, size=20, ctrl=3, override:(a01a8340, fcee10c1, 1, 0)
[75224.482709] mc-err: (255) csw_pcie5w: MC request violates VPR requirements
[75224.489742] mc-err: status = 0x0ff740e3; addr = 0xffffffff00; hi_adr_reg=008
[75224.496956] mc-err: secure: yes, access-type: write
[84371.829025] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f2d47000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=84b6d1003, pte=700007cabf7f47
[84371.829770] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f24c0000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=801685003, pte=700007d28f0f47
[84371.830635] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4e9764000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=8007fc003, pte=700007cad14f47
[84371.831143] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f29a0000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=828bd1003, pte=700007cac30f47
[84371.833004] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4eca60000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=8543aa003, pte=70000806070f47
[84371.834663] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f4980000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=854312003, pte=700007de090f47

And when errors occurs, the DMA transfer not failed and fio is running normally.
Before we developed driver under x86-64 system, if driver did no map DMA address correctly, the DAM operation would fail.
As the infrequent “contex faults” and massive DMA operations, it is too difficult to locate the issue in the driver.
So is there any way make DAM failed when IOMMU faults?

Sorry for the late response, have you managed to get issue resolved or still need the support? Thanks

Hi, yes, we still need support.
Thanks.

No, this is arm behavior. You can add a global flag and as soon smmu fault is hit set the flag, in your driver stop the DMA submission when flag is set.

Sorry for the delay.
The IO of the driver did not fail when smmu fault hit so it is not possible to set the flag. The ssmu fault was found when run command “dmesg”.

Hi,

You have explicitly stop the DMA when smmu fault is observed.

Thanks,
Manikanta

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.