The TX2 Module suddenly crashes and black screen after being turned on for many days

We are using TX2 Module to develop a controller for automatic driving of construction vehicles, and develop and design a carrier board by ourselves.

The TX2 Module suddenly crashes and a black screen appears after being powered on on the carrier board developed by ourselves for many days. The attached file is the printed log. Please help find the cause of the failure, thank youdmesg20210326.log (118.0 KB)

hello aaron.ge,

so, this is a long run stability issue,
may I know which JetPack release you’re working with,
also, may I know how many days you’re running to see the failure.
thanks

[361496.751438] nvgpu: 17000000.gp10b   __nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_vm_unmap+0xe0/0x188 [nvgpu] sync-unmap failed on 0x1e15000000
[361496.751732] nvgpu: 17000000.gp10b   __nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ nvgpu_vm_unmap+0x118/0x188 [nvgpu] 
[361496.810832] ------------[ cut here ]------------
[361496.810842] kernel BUG at /dvs/git/dirty/git-master_linux/kernel/nvgpu/drivers/gpu/nvgpu/common/mm/vm.c:779!
[361496.810861] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[361496.810871] Modules linked in: bnep fuse zram overlay bcmdhd cfg80211 binfmt_misc nvgpu bluedroid_pm ip_tables x_tables
[361496.810926] CPU: 4 PID: 6024 Comm: Xorg Not tainted 4.9.140-tegra #2
[361496.810934] Hardware name: quill (DT)
[361496.810944] task: ffffffc1cf8bb800 task.stack: ffffffc1cd988000
[361496.811222] PC is at nvgpu_vm_get_buffers+0x130/0x148 [nvgpu]
[361496.811499] LR is at nvgpu_vm_get_buffers+0xd0/0x148 [nvgpu]
[361496.811508] pc : [<ffffff8000fcd410>] lr : [<ffffff8000fcd3b0>] pstate: 20400145
[361496.811516] sp : ffffffc1cd98bb50
[361496.811523] x29: ffffffc1cd98bb50 x28: ffffffc1e3038000 
[361496.811536] x27: ffffffc13f19c600 x26: 0000000000000001 
[361496.811549] x25: 0000000000000000 x24: ffffffc1e74a7458 
[361496.811562] x23: ffffffc126c7e400 x22: ffffffc1cd98bbe4 
[361496.811574] x21: ffffffc1cd98bbe8 x20: ffffffc1e74a7400 
[361496.811586] x19: 0000000000000062 x18: 0000000000002077 
[361496.811597] x17: 0000000000000002 x16: 0000000000000000 
[361496.811608] x15: 000038411f22812e x14: 0006e05fb8ef32f8 
[361496.811620] x13: 00000000605d7bb9 x12: 0000000000000017 
[361496.811631] x11: 0000000036935384 x10: 0000000000058417 
[361496.811642] x9 : 0000000000000000 x8 : ffffffc126c7e800 
[361496.811654] x7 : 00000000024000c0 x6 : ffffffc126c7e400 
[361496.811665] x5 : 0000000000000318 x4 : ffffff8001083cb0 
[361496.811676] x3 : 0000000000000062 x2 : 0000000000000000 
[361496.811687] x1 : 0000001e1d000000 x0 : 0000000000000063 

Hi JerryChang:
We use Jetpack4.3, which lasts about 6 days, thank you

hello aaron.ge,

could you please also summarize what’s the background service is running,
are you able to narrow down the issue? thanks

DearJerryChang,
The problem posted by Mr arron arose in our lab, so let me answer your concerns. When the TX2 Module crashed, It is running image proceessing alogrithm, which is a virtual scene about unmanned mine.
The algorithm is not very complicated, but the module has been running for about one week and the temperature is high.
we guess it is the high temperature that results in the crashing. but we are not sure.
Hope this is helpful. Thanks

hello yanzhg,

could you please enable tegrastats utility to monitor the frequencies, temperatures, powers…etc
you should also refer to Thermal Specifications for the supported power states; there’re software and hardware throttling for difference trip points.
thanks

hello JerryChang,
Thanks for your sugesstions. I will try these means.
Thanks!