R36.3 Patch to re-enable GICv2m for PCIe MSI interrupts and restore I/O performance

delwyn · June 24, 2024, 6:26pm

R36.x has a major performance regression over R35.x. The problem is all PCIe MSI interrupts are now routed to CPU 0 whereas they used to be distributed across all CPU cores. For any workload requiring high I/O bandwidth, e.g. multiple NVME devices, this results in lower performance and system lock ups due to CPU 0 being overwhelmed with interrupts.

There was an earlier thread on this issue but with no solution: [Orin NX R36.2 cannot change IRQ smp affinity - #7 by thegtx25]

I’ve ported the original code and changed it around slightly so it doesn’t change kernel behaviour for other platforms. I also increased the number of available MSIs to the maximum 352, helpful when dealing with large numbers of NVMe devices, which allocate an MSI for each CPU core.

I’m posting the patch here in case it’s useful for others, and in the hope that this patch or something based on it could be incorporated into future R36.x releases to address this problem.

It’s very easy to reproduce the issue on the AGX Orin Developer kit. Simply attach two NVMe SSDs (I used Samsung 980 Pro, but any high performance SSDs would do), one in M.2 slot C4 and the other in C5 with an adapter board.

Create fio configuration files for each device (replace nvme0 with nvme1 for the 2nd device):

[global]
thread
ioengine=libaio
iodepth=8
direct=1
bs=1M

[device0]
filename=/dev/nvme0n1

Run fio to read from one device - performance was about 4GiB/s and 80-90% of CPU 0 time spent handling interrupts.
Run a second instance of fio to read from the other device - CPU 0 locks up, ssh connections timeout, display freezes, GPU time outs appear in the kernel logs and the system becomes unusable.

With the patch, both fio processes read at 4GiB/s and system is stable.

msi-patch-kernel.txt (15.7 KB)
msi-patch-dt.txt (4.4 KB)

KevinFFF · June 25, 2024, 4:31am

Hi delwyn,

Do you mean that applying 2 patches shared from you could help for the NVMe reading perfomance?
Are you verifying on the devkit or custom board?

delwyn · June 25, 2024, 9:03am

Hi Kevin,

Yes, the patches I shared fix this problem by re-introducing code from R35.5 that was removed in R36.x. In R36.x MSI interrupts are handled using the standard Designware support, but this means all PCIe MSI interrupts are mapped to a single SPI.

Unfortunately it isn’t as simple as just enabling kernel GICv2m support because the Jetson has a non-standard MSI interrupt address translation as well. The original code in R35.x to handle the address translation was written in such a way that it would have broken other ARM platforms using GICv2m. In my patch I’ve modified it so it should only run when a second resource (for the MSI address translation) has been provided for GICv2m in the device tree, and new Jetson specific functions are wrapped with CONFIG_ARCH_TEGRA. Hopefully these changes will make the patch more acceptable to upstream.

As I mentioned in my post, this can be reproduced and the fix verified on the AGX Orin Developer Kit with 2x NVMe SSDs.

Note this issue affects any PCIe device that generates a lot of interrupts, including 10GbE network cards etc, not just NVMe.

thegtx25 · June 25, 2024, 9:41am

Hi delwyn,

Great to see you’ve managed to make a patch for this issue which has refrained us from upgrading to R36.X.

I’ll give it a try on my setup and keep you posted. Thank you

delwyn · July 8, 2024, 5:35pm

Hi Kevin,

Any thoughts on the patches? What are the chances of the fix being incorporated in a future Jetson Linux release?

WayneWWW · July 9, 2024, 2:41am

Hi,

We need more time on this.

thegtx25 · July 17, 2024, 10:00am

Tried the patches on Orin NX which give back the performances we expect. Great work @delwyn thank you !

However, we got multiple filesystem corruption while testing. We are unsure if this is related to these patches or if that is related to our SSDs failing. One major difference I could see is the GIC_SPI_MSI_SIZE being increased from 70 to 352. Is this something safe ?

delwyn · July 17, 2024, 11:01am

Hey @thegtx25 glad to hear you got the performance back!

The increase of GIC_SPI_MSI_SIZE is safe - the Jetson GIC-600AE implements 960 SPIs. The change is documented here:

https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/HR/JetsonModuleAdaptationAndBringUp/JetsonAgxOrinSeries.html?highlight=msi#enabling-more-spi-interrupts-for-pcie

We haven’t experienced the issue you describe, but it sounds worrying. Are you using the standard Linux NVMe driver? Any errors in the logs?

system · August 28, 2024, 3:49am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AGX Orin R36.3 cannot change IRQ smp affinity Jetson AGX Orin pcie	3	123	July 16, 2024
R36.4 Linux 6.6 Patch to re-enable GICv2m for PCIe MSI interrupts and restore I/O performance Jetson Orin NX pcie	3	85	June 4, 2025
Orin NX R36.2 cannot change IRQ smp affinity Jetson Orin NX kernel , nvbugs	6	969	February 5, 2024
TX2 PCIe MSI multi-vector support Jetson TX2	4	1033	October 18, 2021
pcie msi interrupt Jetson TX2	2	797	October 18, 2021
jetson-tx1 pcie2sata connect hdd disk error with ahci msi as interrupt Jetson TX1	27	4135	October 22, 2024
PCIE spurious interrupts Jetson TX2 pcie	10	1864	October 18, 2021
TX1 PCIe MSI interrupts multi-vector support. Jetson TX1	9	2947	October 18, 2021
PCIe problem after jetpack update Jetson TX1	3	845	October 18, 2021
MSI over PCIe Jetson TX2	11	1984	October 18, 2021

R36.3 Patch to re-enable GICv2m for PCIe MSI interrupts and restore I/O performance

Related topics