JP5.02 system shutdown NVGPU error

Hi,

Based on JP5.02, when system shutdown, it will report error message as follows:

Ubuntu 20.04.4 LTS test-desktop ttyTCU0

test-desktop login: [ 46.536227] Trying to unregister non-registered hwtime source
[ 49.779699] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ gp10b_gr_init_wait_empty+0x168/0x2a0 [nvgpu]
[ 49.780110] nvgpu: 17000000.gv11b gp10b_gr_init_wait_empty:99 [ERR] timeout, ctxsw busy : 0, gr busy : 1, badf1301, badf1301, badf1301, badf1301
[ 49.780653] nvgpu: 17000000.gv11b nvgpu_quiesce:1298 [ERR] failed to prepare for poweroff, err=-11
[ 49.781015] arm-smmu 12000000.iommu: disabling translation
[ 49.781364] arm-smmu 10000000.iommu: disabling translation
[ 49.811239] migrate_one_irq: 8 callbacks suppressed
[ 49.811245] IRQ282: set affinity failed(-22).
[ 49.811473] IRQ283: set affinity failed(-22).
[ 49.811574] IRQ284: set affinity failed(-22).
[ 49.811675] IRQ285: set affinity failed(-22).
[ 49.811792] IRQ286: set affinity failed(-22).
[ 49.811897] IRQ287: set affinity failed(-22).
[ 49.812012] IRQ288: set affinity failed(-22).
[ 49.812112] IRQ289: set affinity failed(-22).
[ 49.813186] CPU1: shutdown
[ 49.831163] IRQ282: set affinity failed(-22).
[ 49.831308] IRQ283: set affinity failed(-22).
[ 49.831831] CPU2: shutdown
[ 49.851107] CPU3: shutdown
[ 49.853795] reboot: Power down
▒▒Shutdown state requested 0
Shutting down system …

Please help to check what’s wrong with NVGPU error?

Any application running on device>
Is it devkit or with custom carrier board?

Does this have any fatal behaior?

It’s on our custome carrier board. I will try it on devkit.

Hi Wayne

There’s no fatal error, and system shutdown is workable. only serial port output NVGPU error message during this period.

Hi Kayccc,

I have tried it on devkit, such error message can also be found.

Follows messages as on Jetson Xavier NX devkit.
[ 403.439766] Trying to unregister non-registered hwtime source
[ 407.553809] nvgpu: 17000000.gv11b nvgpu_timeout_expired_msg_cpu:94 [ERR] Timeout detected @ gp10b_gr_init_wait_empty+0x168/0x2a0 [nvgpu]
[ 407.554221] nvgpu: 17000000.gv11b gp10b_gr_init_wait_empty:99 [ERR] timeout, ctxsw busy : 0, gr busy : 1, badf1301, badf1301, badf1301, badf1301
[ 407.554761] nvgpu: 17000000.gv11b nvgpu_quiesce:1298 [ERR] failed to prepare for poweroff, err=-11
[ 407.555078] arm-smmu 12000000.iommu: disabling translation
[ 407.555428] arm-smmu 10000000.iommu: disabling translation
[ 407.586504] CPU1: shutdown
[ 407.606387] CPU2: shutdown
[ 407.626338] CPU3: shutdown
[ 407.630074] reboot: Power down
▒▒▒▒Shutdown state requested 0
Shutting down system …

I am curious if you’ve installed any realtime app or extension or patch? Or if you’ve manually set any kind of IRQ affinity?

Hi Linuxdev

No, there’s no realtime patched for current kernel. such IRQ numbers is from PCI-MSI, by /proc/interrupts, it shows as IRQ is related to customized board ethernet cards. Why you think it’s related to realtime patch? Thanks.

Statistically, if someone is working with affinity, I’d say they’re likely trying to reduce latency. I didn’t know, it just brought up the possibility. The original log shows several of these:

[ 49.811245] IRQ282: set affinity failed(-22).

It does make me curious about whether the attempt to set affinity is “standard” for the driver, or if it was something specific to the Jetson? Partially I ask this because much of the hardware on a Jetson is only able to direct a hardware interrupt to the first CPU; many hardware devices (this does not apply to software drivers), if told to use a CPU other than the first CPU, will migrate back to the first CPU when they can’t reach the other CPU. Setting affinity of such a device to a non-first-CPU might either be ignored or else result in some unknown error in the kernel. Don’t know, but I do wonder about whether the affinity is trying to use an unavailable core.