GPU driver crash

Hi,
I’m using R21.5 release and I encountered this gpu crash more than once. The system becomes more or less unresponsive and needs to be rebooted:

[ 1672.628648] gk20a gk20a.0: gk20a_pmu_enable_elpg: gk20a_pmu_enable_elpg(): possible elpg refcnt mismatch. elpg refcnt=2
[ 1672.628659] ------------[ cut here ]------------
[ 1672.628672] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3156 gk20a_pmu_enable_elpg+0x1d8/0x2b8()
[ 1672.628677] Modules linked in: nvhost_vi
[ 1672.628692] CPU: 1 PID: 7458 Comm: kworker/1:1 Not tainted 3.10.40-hhone-1.3.0 #24
[ 1672.628702] Workqueue: events pmu_setup_hw
[ 1672.628724] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 1672.628735] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c0062714>] (warn_slowpath_common+0x5c/0x74)
[ 1672.628746] [<c0062714>] (warn_slowpath_common+0x5c/0x74) from [<c00627e0>] (warn_slowpath_null+0x24/0x2c)
[ 1672.628756] [<c00627e0>] (warn_slowpath_null+0x24/0x2c) from [<c03fe3b8>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8)
[ 1672.628770] [<c03fe3b8>] (gk20a_pmu_enable_elpg+0x1d8/0x2b8) from [<c03fe920>] (pmu_setup_hw+0x488/0x890)
[ 1672.628780] [<c03fe920>] (pmu_setup_hw+0x488/0x890) from [<c0081c88>] (process_one_work+0x13c/0x444)
[ 1672.628791] [<c0081c88>] (process_one_work+0x13c/0x444) from [<c00829c0>] (worker_thread+0x140/0x3dc)
[ 1672.628805] [<c00829c0>] (worker_thread+0x140/0x3dc) from [<c0088ee0>] (kthread+0xd4/0xd8)
[ 1672.628816] [<c0088ee0>] (kthread+0xd4/0xd8) from [<c000ed98>] (ret_from_fork+0x14/0x20)
[ 1672.628825] ---[ end trace 40af022902fb81c6 ]---
[ 1672.687667] gk20a gk20a.0: gk20a_pmu_disable_elpg: gk20a_pmu_disable_elpg(): possible elpg refcnt mismatch. elpg refcnt=1
[ 1672.687680] ------------[ cut here ]------------
[ 1672.687700] WARNING: at drivers/gpu/nvgpu/gk20a/pmu_gk20a.c:3192 gk20a_pmu_disable_elpg+0x80/0x2e0()
[ 1672.687708] Modules linked in: nvhost_vi
[ 1672.687730] CPU: 1 PID: 12624 Comm: gstglcontext Tainted: G        W    3.10.40-hhone-1.3.0 #24
[ 1672.687752] [<c0016370>] (unwind_backtrace+0x0/0x13c) from [<c0012c0c>] (show_stack+0x18/0x1c)
[ 1672.687765] [<c0012c0c>] (show_stack+0x18/0x1c) from [<c0062714>] (warn_slowpath_common+0x5c/0x74)
[ 1672.687773] [<c0062714>] (warn_slowpath_common+0x5c/0x74) from [<c00627e0>] (warn_slowpath_null+0x24/0x2c)
[ 1672.687782] [<c00627e0>] (warn_slowpath_null+0x24/0x2c) from [<c03feda8>] (gk20a_pmu_disable_elpg+0x80/0x2e0)
[ 1672.687795] [<c03feda8>] (gk20a_pmu_disable_elpg+0x80/0x2e0) from [<c03e45c0>] (gk20a_alloc_obj_ctx+0x8c8/0xbd4)
[ 1672.687805] [<c03e45c0>] (gk20a_alloc_obj_ctx+0x8c8/0xbd4) from [<c03d3e68>] (gk20a_channel_ioctl+0x6a8/0x10cc)
[ 1672.687819] [<c03d3e68>] (gk20a_channel_ioctl+0x6a8/0x10cc) from [<c01588b8>] (do_vfs_ioctl+0x3f8/0x5b8)
[ 1672.687829] [<c01588b8>] (do_vfs_ioctl+0x3f8/0x5b8) from [<c0158ad0>] (SyS_ioctl+0x58/0x168)
[ 1672.687839] [<c0158ad0>] (SyS_ioctl+0x58/0x168) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)
[ 1672.687845] ---[ end trace 40af022902fb81c7 ]---

Can you help me understand where the problem is?

Thank you, best regards.

Ivan

Does this show all “ok”?

sha1sum -c /etc/nv_tegra_release

When does this occur, e.g., has it been running something for awhile, or does this occur upon boot completing, is there a particular program triggering this, so on?

Could you try this workaround?

sudo echo 0 > /sys/devices/platform/host1x/gk20a.0/elpg_enable

Hi linuxdev and WayneWWW,

yes, the sha1sum is ok. It seems that the issue is triggered by an application that uses gstreamer C APIs with omxh264enc and omxh264dec, but I’m not sure of this, because it happens only sometimes and I don’t have steps to reproduce it.

Thank you WayneWWW, I will try this. What is the impact of disabling the elpg?

Thank you, best regards.

Ivan

Hi IvanGolob,

Have you tested the workaround? Any abnormal or crash found?

Thanks

Hi kayccc,

after I set “elpg_enable” to 0, I have not seen the gpu crash anymore.

Thank you, best regards.

Ivan

Hi kayccc, WayneWWW, or IvanGolob,

I have been seeing a gpu crash similar to the above. I have also tried the patch here (https://devtalk.nvidia.com/default/topic/1025936/jetson-tk1/tk1-demo-board-hang/post/5224436/#5224436) but that did not do anything. Disabling elpg seems to work.

But I need to know what the implications are of disabling elpg…I tried to look around at documentation but I was not able to find what exactly elpg is for.

If I disable it, it does not crash anymore but what are the consequences of disabling it?