Jetson freezes up and emits CPU stack & call trace

Here I have one set of logs with a small stack trace and some details about the CPU/registers.

I have 9 devices, all the same, that’s the only one that does that. Do I need to replace it?

Thank you.
Alexis

[Sat Jan 23 15:29:44 2021] ------------[ cut here ]------------
[Sat Jan 23 15:29:44 2021] WARNING: CPU: 0 PID: 2548 at /dvs/git/dirty/git-master_linux/kernel/nvgpu/drivers/gpu/nvgpu/common/pmu/pmu_pg.c:275 nvgpu_pmu_disable_elpg+0xf4/0x348 [nvgpu]
[Sat Jan 23 15:29:44 2021] Modules linked in: fuse bnep zram overlay nf_log_ipv6 ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_multiport spidev xt_conntrack nf_conntrack iptable_filter userspace_alert nvgpu bluedroid_pm ip_tables x_tables

[Sat Jan 23 15:29:44 2021] CPU: 0 PID: 2548 Comm: irq/476-gk20a_s Tainted: G        W       4.9.140-tegra #1
[Sat Jan 23 15:29:44 2021] Hardware name: Jetson-AGX (DT)
[Sat Jan 23 15:29:44 2021] task: ffffffc7da096200 task.stack: ffffffc7c155c000
[Sat Jan 23 15:29:44 2021] PC is at nvgpu_pmu_disable_elpg+0xf4/0x348 [nvgpu]
[Sat Jan 23 15:29:44 2021] LR is at nvgpu_pmu_disable_elpg+0xf4/0x348 [nvgpu]
[Sat Jan 23 15:29:44 2021] pc : [<ffffff8000fdcd1c>] lr : [<ffffff8000fdcd1c>] pstate: 20c00045
[Sat Jan 23 15:29:44 2021] sp : ffffffc7c155fbe0
[Sat Jan 23 15:29:44 2021] x29: ffffffc7c155fbf0 x28: 0000000000000000 
[Sat Jan 23 15:29:44 2021] x27: 0000000000000001 x26: 0000000000000000 
[Sat Jan 23 15:29:44 2021] x25: ffffff800105b470 x24: ffffffc7c28d26a8 
[Sat Jan 23 15:29:44 2021] x23: ffffffc7c28d2d28 x22: 0000000000000000 
[Sat Jan 23 15:29:44 2021] x21: ffffffc7c28d8000 x20: ffffff8001062b38 
[Sat Jan 23 15:29:44 2021] x19: ffffffc7c28d0000 x18: 0000000000000003 
[Sat Jan 23 15:29:44 2021] x17: 0000007f8812e258 x16: 00000000001ca875 
[Sat Jan 23 15:29:44 2021] x15: ffffffffffffffff x14: 5f756d705f757067 
[Sat Jan 23 15:29:44 2021] x13: 766e20205d4e5257 x12: 5b20203437323a67 
[Sat Jan 23 15:29:44 2021] x11: 706c655f656c6261 x10: 7369645f756d705f 
[Sat Jan 23 15:29:44 2021] x9 : 757067766e202020 x8 : ffffffc7ffc1a6d4 
[Sat Jan 23 15:29:44 2021] x7 : 0000000000000000 x6 : 00000000133a5967 
[Sat Jan 23 15:29:44 2021] x5 : 0000000000000000 x4 : 0000000000000000 
[Sat Jan 23 15:29:44 2021] x3 : ffffffffffffffff x2 : 00000047f642b000 
[Sat Jan 23 15:29:44 2021] x1 : ffffffc7da096200 x0 : 000000000000008a 

[Sat Jan 23 15:29:44 2021] ---[ end trace a5f50b22b422d710 ]---
[Sat Jan 23 15:25:07 2021] Call trace:
[Sat Jan 23 15:25:07 2021] [<ffffff8000fdcd1c>] nvgpu_pmu_disable_elpg+0xf4/0x348 [nvgpu]
[Sat Jan 23 15:25:07 2021] [<ffffff8000fdd064>] nvgpu_pmu_pg_global_enable+0xf4/0x108 [nvgpu]
[Sat Jan 23 15:25:07 2021] [<ffffff8000f9a4b8>] nvgpu_pg_elpg_disable+0xb0/0xc8 [nvgpu]
[Sat Jan 23 15:25:07 2021] [<ffffff8000f9736c>] mc_gp10b_isr_stall+0xac/0x218 [nvgpu]
[Sat Jan 23 15:25:07 2021] [<ffffff8000fa9728>] nvgpu_intr_thread_stall+0x50/0x1d8 [nvgpu]
[Sat Jan 23 15:25:07 2021] [<ffffff8000fb9940>] nvgpu_fecs_trace_init_debugfs+0x30f8/0x3198 [nvgpu]
[Sat Jan 23 15:25:07 2021] [<ffffff8008123980>] irq_thread_fn+0x30/0x80
[Sat Jan 23 15:25:07 2021] [<ffffff8008123cbc>] irq_thread+0x11c/0x1a8
[Sat Jan 23 15:25:07 2021] [<ffffff80080dbe64>] kthread+0xec/0xf0
[Sat Jan 23 15:25:07 2021] [<ffffff80080838a0>] ret_from_fork+0x10/0x30
[Sat Jan 23 15:25:07 2021] nvgpu: 17000000.gv11b             nvgpu_pmu_enable_elpg:208  [WRN]  nvgpu_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2

We didn’t see such error recently. I would like to say maybe a hardware issue.

Could you share the detail of software version ,if you are using custom board, and how to trigger this issue?

Is there any pattern or method that can definitely hit this? If not, maybe RMA this module.

I always had problems with that specific unit, although at first it seemed to only be issues with the network. I’ve not really see patterns I could talk about or how it gets triggered. Today it rebooted on its own… that was the first time I saw that. It also once failed to auto-login which is important for our software to get started (Since we use X11).

Since the other units works as expected with the same software version on all of them, I would imagine that it’s not that relevant?

Yes, it sounds no relevant. Please try to RMA that problematic module first.

1 Like

Hi Wayne, I have the board removed from our system and ready for shipping. How do I obtain an RMA from NVidia? I purchased the boards directly from your Store and I just don’t see a place to get such at the moment. If you have a direct link to the RMA form or whatever you have, it would be great.

Thank you.

Ah, I found this form here:

https://store.nvidia.com/DRHM/store#bottomForm

Hopefully that’s the right one. :-)

Hello,

You can check it here.