lspci
…
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:04.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:0c.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:02:1c.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:03:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:10.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:14.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:18.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:04:1c.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:07:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
0005:09:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:0a:00.0 PCI bridge: LSI Logic / Symbios Logic Device c010 (rev b0)
0005:0b:00.0 RAID bus controller: HighPoint Technologies, Inc. Device 7505 (rev 01)
…
uname -a
Linux test-desktop 4.9.201-tegra #1 SMP PREEMPT Fri Jul 9 08:56:59 PDT 2021 aarch64 aarch64 aarch64 GNU/Linux
Issue:
When running fio over NVMe RAID(s) attached to SSD7505:
fio --filename=/dev/hptblock2n2p --direct=1 --rw=read --ioengine=libaio --bs=128k --iodepth=64 --numjobs=8 --runtime=4h --time_base=1 --group_reporting --name=read
The kernel reports following messages occasionally:
[75223.684557] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4ece40000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=8543a4003, pte=700007ee920f47
[75223.691068] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f60a0000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=854328003, pte=700007e5ca0f47
[75224.403069] irq 102: nobody cared (try booting with the “irqpoll” option)
[75224.403219] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 4.9.201-tegra #1
[75224.403339] Hardware name: Jetson-AGX (DT)
[75224.403407] Call trace:
[75224.403458] [] dump_backtrace+0x0/0x198
[75224.403546] [] show_stack+0x24/0x30
[75224.403631] [] dump_stack+0xa0/0xc8
[75224.403720] [] __report_bad_irq+0x3c/0xf8
[75224.403811] [] note_interrupt+0x2c8/0x318
[75224.403902] [] handle_irq_event_percpu+0x50/0x60
[75224.404000] [] handle_irq_event+0x50/0x80
[75224.404090] [] handle_fasteoi_irq+0xd4/0x1c0
[75224.404184] [] generic_handle_irq+0x34/0x50
[75224.404593] [] __handle_domain_irq+0x68/0xc0
[75224.405052] [] gic_handle_irq+0x5c/0xb0
[75224.405471] [] el1_irq+0xe8/0x194
[75224.405843] [] irq_exit+0xd0/0x118
[75224.406219] [] __handle_domain_irq+0x6c/0xc0
[75224.411000] [] gic_handle_irq+0x5c/0xb0
[75224.416078] [] el1_irq+0xe8/0x194
[75224.421152] [] cpuidle_enter_state+0xb8/0x380
[75224.426753] [] cpuidle_enter+0x34/0x48
[75224.432267] [] call_cpuidle+0x44/0x70
[75224.437685] [] cpu_startup_entry+0x1b0/0x200
[75224.443296] [] rest_init+0x84/0x90
[75224.448106] [] start_kernel+0x370/0x384
[75224.453616] [] __primary_switched+0x80/0x94
[75224.459471] handlers:
[75224.461750] [] tegra_mcerr_hard_irq threaded [] tegra_mcerr_thread
[75224.471370] Disabling IRQ #102
[75224.474511] mc-err: vpr base=0:c6000000, size=20, ctrl=3, override:(a01a8340, fcee10c1, 1, 0)
[75224.482709] mc-err: (255) csw_pcie5w: MC request violates VPR requirements
[75224.489742] mc-err: status = 0x0ff740e3; addr = 0xffffffff00; hi_adr_reg=008
[75224.496956] mc-err: secure: yes, access-type: write
[84371.829025] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f2d47000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=84b6d1003, pte=700007cabf7f47
[84371.829770] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f24c0000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=801685003, pte=700007d28f0f47
[84371.830635] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4e9764000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=8007fc003, pte=700007cad14f47
[84371.831143] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f29a0000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=828bd1003, pte=700007cac30f47
[84371.833004] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4eca60000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=8543aa003, pte=70000806070f47
[84371.834663] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0x4f4980000, fsynr=0x180013, cb=3, sid=91(0x5b - PCIE5), pgd=85569a003, pud=85569a003, pmd=854312003, pte=700007de090f47
And when errors occurs, the DMA transfer not failed and fio is running normally.
Before we developed driver under x86-64 system, if driver did no map DMA address correctly, the DAM operation would fail.
As the infrequent “contex faults” and massive DMA operations, it is too difficult to locate the issue in the driver.
So is there any way make DAM failed when IOMMU faults?