Drive AGX Xavier shuts down during operation (symptom 2/2), #19071102

Dear Manager,
Drive AGX Xavier (not Pegasus) shuts down while in operation.
Frequency is a few times a week.
DriveSoftware 9

Sent: Thursday, July 11, 2019 18:29

Thank you for the information. As I spoke over the phone, let me share with you a log file we got when a Tegra B got stalled.
We haven’t deep-dived in this issue yet, but let us know if you get any clue.

Regards,

[ 3472.632362] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3472.639438] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3482.843172] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3482.951237] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3482.958353] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3487.432801] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 3487.432817] 5-…: (2 GPs behind) idle=cb7/140000000000000/0 softirq=0/0 fqs=2420
[ 3487.432822] (detected by 1, t=5252 jiffies, g=223818, c=223817, q=17880)
[ 3487.500769] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3487.500780] 5-…: (1 GPs behind) idle=cb7/140000000000000/0 softirq=0/0 fqs=117
[ 3487.500785] (detected by 3, t=5252 jiffies, g=70923, c=70922, q=22)
[ 3487.500904] rcu_sched kthread starved for 4995 jiffies! g70923 c70922 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[ 3493.078005] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3493.218009] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3493.224389] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3503.313017] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3503.420960] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3503.427618] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3513.547805] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3513.659821] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3513.666827] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3523.782704] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3523.891040] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3523.897389] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3534.017634] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3534.125623] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3534.132696] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3544.252457] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3544.361283] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3544.368236] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3550.421307] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 3550.421390] 5-…: (2 GPs behind) idle=cb7/140000000000000/0 softirq=0/0 fqs=9646
[ 3550.421396] (detected by 2, t=21007 jiffies, g=223818, c=223817, q=25750)
[ 3550.489279] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3550.489295] 5-…: (1 GPs behind) idle=cb7/140000000000000/0 softirq=0/0 fqs=7412
[ 3550.489299] (detected by 2, t=21007 jiffies, g=70923, c=70922, q=22)
[ 3554.487347] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3554.627461] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3554.634869] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3564.722233] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3564.834286] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3564.840621] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3574.957099] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3575.065127] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3575.071211] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3585.191948] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3585.299978] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3585.306189] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3595.426844] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3595.534888] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3595.541080] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3605.661800] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3605.769773] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3605.776665] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
[ 3613.409838] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 3613.409860] 5-…: (2 GPs behind) idle=cb7/140000000000000/0 softirq=0/0 fqs=16611
[ 3613.409866] (detected by 3, t=36762 jiffies, g=223818, c=223817, q=30512)
[ 3613.477780] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 3613.477792] 5-…: (1 GPs behind) idle=cb7/140000000000000/0 softirq=0/0 fqs=14355
[ 3613.477867] (detected by 6, t=36762 jiffies, g=70923, c=70922, q=22)
[ 3615.896677] gk20a 17810000.vgpu: tegra_gr_comm_recv: timeout for response!
[ 3616.028596] nvgpu: 17810000.vgpu gk20a_channel_timeout_handler:1434 [ERR] Job on channel 255 timed out
[ 3616.035081] nvgpu: 17810000.vgpu nvgpu_set_error_notifier_locked:135 [ERR] error notifier set to 8 for ch 255
���ÿÿ�

Ubuntu 16.04.6 LTS tegra-b ttyS0

tegra-b login:

Ubuntu 16.04.6 LTS tegra-b ttyS0

tegra-b login:

Ubuntu 16.04.6 LTS tegra-b ttyS0

tegra-b login: nvidia

Password:
Last login: Tue Jul 2 11:21:47 JST 2019 from 10.42.0.28 on pts/4
Welcome to Ubuntu 16.04.6 LTS (GNU/Linux 4.9.111-rt76-tegra aarch64)

221 packages can be updated.
0 updates are security updates.

New release ‘18.04.2 LTS’ available.
Run ‘do-release-upgrade’ to upgrade to it.

*** /dev/vblkdev0p1 will be checked for errors at next reboot ***

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

[ 149.401703] NOHZ: local_softirq_pending 80
[ 150.384773] NOHZ: local_softirq_pending 80
[ 150.384819] NOHZ: local_softirq_pending 80
[ 151.272768] NOHZ: local_softirq_pending 80
[ 167.224754] NOHZ: local_softirq_pending 80
[ 168.823972] NOHZ: local_softirq_pending 80
[ 174.221285] NOHZ: local_softirq_pending 80
[ 177.355672] NOHZ: local_softirq_pending 80
[ 179.166754] NOHZ: local_softirq_pending 80
[ 194.047328] NOHZ: local_softirq_pending 80

Welcome to Drive Linux running on Ubuntu 16.04!

nvidia@tegra-b:~$
1907021043_serial.log (7.75 KB)

Dear khirose,

Filed a bug for this symptom, please Bug 200534781.