Jeston origin NX 网卡中断绑定到CPU失败

问题描述
我们系统的wifi 各种有线网卡目前中断都在CPU0上,现在想要把一个PCI的有线网卡中断绑定到其他CPU上比如CPU1,
但是使用echo 1 > /proc/irq/中断号/smp_affinity_list的方法出错,
~$ cat /sys/kernel/irq/252/chip_name
PCI-MSI
根据中断号类型PCI-MSI查看代码drivers/pci/controller/dwc/pcie-designware-host.c中
static int dw_pci_msi_set_affinity(struct irq_data *d,
const struct cpumask *mask, bool force)
{
return -EINVAL;
}
该函数并未实现。
我们的问题:
~$ cat /proc/irq/252/smp_affinity_list
0-5
该中断是显示0-5分布,但是实际上即使负载特别高,他从未均衡到其他CPU
/proc/irq/252$ cat effective_affinity
00
这个是我们的Jetson Orin NX硬件不支持还是其他原因没实现tegra_msi_set_affinity?
kernel 5.15.148-rt-tegra #90 SMP PREEMPT_RT, 是不是如果在RT kernel上这样做会有什么风险?
Jetson Orin NX平台是否有其他的方法将该PCI网卡的中断绑定到其他CPU上 ?感谢

错误码
sudo bash -c “echo 1 > /proc/irq/252/smp_affinity_list”
bash: line 1: echo: write error: Invalid argument

错误日志
查看PCI 网卡的中断号,我们想把enP8p1s0-0的中断从CPU0挪到其他CPU
~$ cat /proc/interrupts | grep -E ‘enP8p1s0|CPU’
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
252: 4509796 0 0 0 0 0 PCI-MSI 1074266112 Edge enP8p1s0-0
查看中断252当前的绑定是0-5,虽然写的是0-5,但是高负载或者CPU0各种网卡中断特别多的时候,他并没有均衡到其他CPU
~$ cat /proc/irq/252/smp_affinity_list
0-5
尝试绑定中断到其他CPU,比如1,提示失败
sudo bash -c “echo 1 > /proc/irq/252/smp_affinity_list”
bash: line 1: echo: write error: Invalid argument

*** Please note that this reply is generated by LLM automatically ***

The provided text appears to be a series of commands and outputs related to the configuration and management of interrupts on a Jetson Orin NX device. The commands are attempting to set the affinity of a specific interrupt (IRQ 252) to a particular CPU core.

Here’s a breakdown of the commands and their outputs:

  1. echo 1 > /proc/irq/252/smp_affinity_list: This command attempts to set the affinity of IRQ 252 to CPU 1. However, the output indicates that the command fails with an “Invalid argument” error.

  2. cat /proc/irq/252/smp_affinity_list: This command displays the current affinity of IRQ 252, which is set to CPU 0 and CPU 1 (05).

  3. cat /sys/kernel/irq/252/chip_name: This command displays the chip name associated with IRQ 252, which is “PCIMSI”.

  4. cat /sys/kernel/irq/252/pcimsi/driver/pci/controller/pci-tegra-egr/pcie-tlp: This command displays information about the PCIe controller associated with IRQ 252.

  5. sudo bash -c 'echo 1 > /proc/irq/252/smp_affinity_list': This command attempts to set the affinity of IRQ 252 to CPU 1 using sudo. However, the output still indicates an “Invalid argument” error.

  6. cat /proc/interrupts | grep enP8p1s0: This command displays information about the interrupts associated with the enP8p1s0 device. The output shows that IRQ 252 is associated with this device and is currently handled by CPU 0.

  7. cat /proc/irq/252/smp_affinity_list: This command again displays the current affinity of IRQ 252, which is still set to CPU 0 and CPU 1 (05).

The error message “Invalid argument” suggests that the command to set the affinity of IRQ 252 to CPU 1 is not valid or is not supported by the system.

In comparison, the commands and outputs suggest that the system is configured to handle IRQ 252 on multiple CPU cores (CPU 0 and CPU 1). However, the attempts to set the affinity of IRQ 252 to a single CPU core (CPU 1) fail due to an “Invalid argument” error.

It’s worth noting that the provided text does not include any information about the Image Generation, RAG & Vector Database, SDK Integrations, or Audio sections mentioned in the context. The text only appears to be related to the configuration and management of interrupts on a Jetson Orin NX device.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

PCIe MSI IRQ is disabled in Jetpack6. If you want to use that, you need to enable it by applying below patch.

Many thanks for your answer. After applying the patch, we were able to set irq affinity and have been testing it for several days under different scenarios. We still have some questions though. We are on an RT kernel, version 5.15.148-rt-tegra PREEMPT_RT. Would this patch have any negative impact on the RT kernel?This code was present in version R35 but has been removed starting from R36. Do you know why it was removed? Was it taken out due to any particular impact or issues?

Now, there is a testing method that causes issues: SSH into the system via a PCIe network port (the one whose parent was modified to gic_v2m). Before the modification, I could still operate via SSH even while running the command
chrt -f 90 taskset -c 0 stress --cpu 1 --timeout 3600s, and I could interrupt it with Ctrl+C. After applying the patch, however, that command seems to completely hang all IRQs. SSH over the PCIe network port becomes totally unresponsive, and the serial console also prints a call trace related to WiFi hang.

After applying the patch, the interrupt type changed from PCI MSI to MSI, and its parent was set to GIC-V2M. Could these changes have affected interrupt responsiveness? Thank you.

Is this still an issue to support? Any result can be shared?

Hi Kayccc,

After applying the patch, we were able to set irq affinity and have been testing it for several days under different scenarios. We still have some questions though. We are on an RT kernel, version 5.15.148-rt-tegra PREEMPT_RT. Would this patch have any negative impact on the RT kernel?This code was present in version R35 but has been removed starting from R36. Do you know why it was removed? Was it taken out due to any particular impact or issues? We are considering whether apply this patch to production environment and want to know will there be any other impacts, so need your suggestions with above questions. Thank you.

Hi,
On Jetpack 6, kernel is aligned with upstream kernel 5.15 and the patches aren’t in upstream. So if you need the function, would need to apply the patches manually.

It’s supposed to work the same in RT kernel although we don’t validate the setup in SQA coverage. Would suggest do thorough verification on RT kernel.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.