TX2 kernel 4.4.38 PCIe causing IRQ55 error, but not on TX1 with same kernel

Hello,

I am working with a PCIe video capture card with open source driver. I have been running this card and driver on TX1 with Jetpack 3.3 kernel 4.4.38 with no issues. This week I switched to TX2 for more performance. Using the same driver I ran into SMMU errors so I disabled the SMMU for PCIe by removing <&{/pcie-controller@10003000} TEGRA_SID_AFI> from device tree as recommended on this forum. I can now probe the card and communicate with it, but when I start streaming with DMA I get the follow in dmesg, and the video either does not stream or the frame rate is extremely slow and the data is corrupt. All other dmesg output related to the driver and hardare matches the working TX1 system.

I learned from this forum that IRQ55 is memory controller error, but I no idea how to debug. Can anyone help?

After searching the internet I wonder if its possible that TX1 and TX2 have different memory spaces for DMA which could cause my older chip’s DMA controller to fail with. A 32-bit limitation? Can I restrict the TX2’s memory space?

[  157.199888] irq 55: nobody cared (try booting with the "irqpoll" option)
[  157.206587] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.38 #10
[  157.212582] Hardware name: quill (DT)
[  157.216234] Call trace:
[  157.218681] [<ffffffc000089860>] dump_backtrace+0x0/0x100
[  157.224069] [<ffffffc000089a28>] show_stack+0x14/0x1c
[  157.229112] [<ffffffc0003159f8>] dump_stack+0x98/0xc0
[  157.234154] [<ffffffc0000f7e4c>] __report_bad_irq+0x38/0xe0
[  157.239716] [<ffffffc0000f81a4>] note_interrupt+0x1f4/0x2b4
[  157.245278] [<ffffffc0000f56c8>] handle_irq_event_percpu+0xfc/0x290
[  157.251531] [<ffffffc0000f58a0>] handle_irq_event+0x44/0x74
[  157.257092] [<ffffffc0000f8ba8>] handle_fasteoi_irq+0xb4/0x188
[  157.262911] [<ffffffc0000f4c70>] generic_handle_irq+0x24/0x38
[  157.268644] [<ffffffc0000f4f78>] __handle_domain_irq+0x60/0xb4
[  157.274465] [<ffffffc000081774>] gic_handle_irq+0x5c/0xb4
[  157.279852] [<ffffffc000084740>] el1_irq+0x80/0xf8
[  157.284635] [<ffffffc0000a91b0>] irq_exit+0x84/0xdc
[  157.289502] [<ffffffc0000f4f84>] __handle_domain_irq+0x6c/0xb4
[  157.295322] [<ffffffc000081774>] gic_handle_irq+0x5c/0xb4
[  157.300708] [<ffffffc000084740>] el1_irq+0x80/0xf8
[  157.305491] [<ffffffc0007b8e4c>] cpuidle_enter+0x18/0x20
[  157.310793] [<ffffffc0000e8354>] call_cpuidle+0x28/0x50
[  157.316006] [<ffffffc0000e84f8>] cpu_startup_entry+0x17c/0x340
[  157.321827] [<ffffffc000a6a898>] rest_init+0x84/0x8c
[  157.326783] [<ffffffc000f02980>] start_kernel+0x3a0/0x3b4
[  157.332169] [<0000000080a71000>] 0x80a71000
[  157.336342] handlers:
[  157.338611] [<ffffffc00084e200>] tegra_mcerr_hard_irq threaded [<ffffffc00084e24c>] tegra_mcerr_thread
[  157.347918] Disabling IRQ #55
[  157.351022] (255) csw_afiw: EMEM address decode error
[  157.356115]   status = 0x20010031; addr = 0x00000000
[  157.361114]   secure: no, access-type: write

When I run sudo lscpi -vvv on TX1 vs TX2 I see this difference
TX1

Region 0: Memory at 13200000 (32-bit, prefetchable) 

TX2

Region 0: Memory at 50300000 (32-bit, prefetchable) 

Any input from nvidia team?