JP5.02 system shutdown NVGPU error

Hi,

Based on JP5.02, when system shutdown, it will report error message as follows:

Ubuntu 20.04.4 LTS test-desktop ttyTCU0

test-desktop login: [ 46.536227] Trying to unregister non-registered hwtime source
[ 49.779699] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ gp10b_gr_init_wait_empty+0x168/0x2a0 [nvgpu]
[ 49.780110] nvgpu: 17000000.gv11b gp10b_gr_init_wait_empty:99 [ERR] timeout, ctxsw busy : 0, gr busy : 1, badf1301, badf1301, badf1301, badf1301
[ 49.780653] nvgpu: 17000000.gv11b nvgpu_quiesce:1298 [ERR] failed to prepare for poweroff, err=-11
[ 49.781015] arm-smmu 12000000.iommu: disabling translation
[ 49.781364] arm-smmu 10000000.iommu: disabling translation
[ 49.811239] migrate_one_irq: 8 callbacks suppressed
[ 49.811245] IRQ282: set affinity failed(-22).
[ 49.811473] IRQ283: set affinity failed(-22).
[ 49.811574] IRQ284: set affinity failed(-22).
[ 49.811675] IRQ285: set affinity failed(-22).
[ 49.811792] IRQ286: set affinity failed(-22).
[ 49.811897] IRQ287: set affinity failed(-22).
[ 49.812012] IRQ288: set affinity failed(-22).
[ 49.812112] IRQ289: set affinity failed(-22).
[ 49.813186] CPU1: shutdown
[ 49.831163] IRQ282: set affinity failed(-22).
[ 49.831308] IRQ283: set affinity failed(-22).
[ 49.831831] CPU2: shutdown
[ 49.851107] CPU3: shutdown
[ 49.853795] reboot: Power down
▒▒Shutdown state requested 0
Shutting down system …

Please help to check what’s wrong with NVGPU error?

Any application running on device>
Is it devkit or with custom carrier board?

Does this have any fatal behaior?

It’s on our custome carrier board. I will try it on devkit.

Hi Wayne

There’s no fatal error, and system shutdown is workable. only serial port output NVGPU error message during this period.

Hi Kayccc,

I have tried it on devkit, such error message can also be found.

Follows messages as on Jetson Xavier NX devkit.
[ 403.439766] Trying to unregister non-registered hwtime source
[ 407.553809] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ gp10b_gr_init_wait_empty+0x168/0x2a0 [nvgpu]
[ 407.554221] nvgpu: 17000000.gv11b gp10b_gr_init_wait_empty:99 [ERR] timeout, ctxsw busy : 0, gr busy : 1, badf1301, badf1301, badf1301, badf1301
[ 407.554761] nvgpu: 17000000.gv11b nvgpu_quiesce:1298 [ERR] failed to prepare for poweroff, err=-11
[ 407.555078] arm-smmu 12000000.iommu: disabling translation
[ 407.555428] arm-smmu 10000000.iommu: disabling translation
[ 407.586504] CPU1: shutdown
[ 407.606387] CPU2: shutdown
[ 407.626338] CPU3: shutdown
[ 407.630074] reboot: Power down
▒▒▒▒Shutdown state requested 0
Shutting down system …

I am curious if you’ve installed any realtime app or extension or patch? Or if you’ve manually set any kind of IRQ affinity?

Hi Linuxdev

No, there’s no realtime patched for current kernel. such IRQ numbers is from PCI-MSI, by /proc/interrupts, it shows as IRQ is related to customized board ethernet cards. Why you think it’s related to realtime patch? Thanks.

Statistically, if someone is working with affinity, I’d say they’re likely trying to reduce latency. I didn’t know, it just brought up the possibility. The original log shows several of these:

[ 49.811245] IRQ282: set affinity failed(-22).

It does make me curious about whether the attempt to set affinity is “standard” for the driver, or if it was something specific to the Jetson? Partially I ask this because much of the hardware on a Jetson is only able to direct a hardware interrupt to the first CPU; many hardware devices (this does not apply to software drivers), if told to use a CPU other than the first CPU, will migrate back to the first CPU when they can’t reach the other CPU. Setting affinity of such a device to a non-first-CPU might either be ignored or else result in some unknown error in the kernel. Don’t know, but I do wonder about whether the affinity is trying to use an unavailable core.

Yes, normally hardware interrupt is bind to core0 by default. on my board, IRQ282 is I210 ethernet (extended ethernet port by PCIe HUB) RX-TX, its interrupt is bind to CPU0 as follows:

282: 14 0 0 0 0 0 PCI-MSI 538968066 Edge eth1-TxRx-1

Also I found Xavier NX module default power mode on JP5.02 is “MODE 10W DESKTOP”, in such configuration, it will enable only 4 cores, such “set affinity failed” messages happened when kernel brings up and try to diable CPU4/CPU5.

[ 10.191497] IRQ221: set affinity failed(-22).
[ 10.191669] IRQ280: set affinity failed(-22).
[ 10.191780] IRQ281: set affinity failed(-22).
[ 10.191890] IRQ282: set affinity failed(-22).
[ 10.191992] IRQ283: set affinity failed(-22).
[ 10.192094] IRQ284: set affinity failed(-22).
[ 10.192196] IRQ285: set affinity failed(-22).
[ 10.192298] IRQ286: set affinity failed(-22).
[ 10.192400] IRQ287: set affinity failed(-22).
[ 10.192503] IRQ288: set affinity failed(-22).
[ 10.194014] CPU4: shutdown
[ 10.194404] psci: CPU4 killed (polled 0 ms)
[ 10.243597] CPU5: shutdown
[ 10.243766] psci: CPU5 killed (polled 0 ms)

If I changed power mode to “MODE 15W 6CORE”, all 6 cores will be enabled. no such irq affinity setting issue reported when kernel brings up.
@WayneWWW could you help to check whether Xavier NX module will change IRQ affinity on JP5.0.2 and why? Thanks.

For PCI-MSI, the 5.10 kernel is different with 4.9,
In kernel-5.10/drivers/pci/controller/dwc/pcie-designware-host.c, added

 static struct irq_chip dw_pci_msi_bottom_irq_chip = {
        .name = "DWPCI-MSI",
        .irq_ack = dw_pci_bottom_ack,
        .irq_compose_msi_msg = dw_pci_setup_msi_msg,
        .irq_set_affinity = dw_pci_msi_set_affinity,
        .irq_mask = dw_pci_bottom_mask,
        .irq_unmask = dw_pci_bottom_unmask,
};

But the dw_pci_msi_set_affinity just return -EINVAL

In 4.9 , use gic_set_affinity of kernel-4.9/drivers/irqchip/irq-gic.c
which is not empty.

Thus, I think that is the main reason that you saw IQR affinity log coming out. This is not related to i210 driver.

Hi Wayne

Thanks for your update. I also found some discussion about this patch and fix:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2320789.html
https://lkml.kernel.org/lkml/1a1c41f1-4085-6b24-adea-d1e0867e7d9d@nvidia.com/T/

please take a look at this.

And, how about NVGPU error message and fix plan? Thanks.

Looks like the nvgpu error log can be safely ignore when running shutdown command. Eventually the board will be shut down and PWR led turned off.

But trying to reboot into force-recovery mode will fail. It actually stops the reboot process and I cannot manually put the board into recovery mode.

I tried on my Jetson Xavier NX board (Carrier board made by Seeed, supposedly identical with devkit). I attached log file for shutdown and reboot case.

reboot.log (137.4 KB)
shutdown.log (137.3 KB)

@user93433 please help to open a new topic if it’s still an issue.
Thanks

I don’t see any reason why a new topic is needed?
It is the same issue, this topic wasn’t closed when I post my comment, and there wasn’t any resolution.

Is there any update on this issue with nvgpu timing out during shutdown? I’ve been seeing the same thing on an AGX Xavier for a while now. This is more than a cosmetic issue for us, as we have to shut down quickly on backup power before total power loss.

Perhaps a hint at what might be happenening – in my case (power model MAXN) if i don’t run jetson_clocks at boot, nvgpu will unload cleanly at shut down.