Gk20a and Jetson Nano crash

borelli.g92 · September 15, 2020, 7:39pm

I have been testing the following configuration:
A02 model + new SD installation + test3-app
With a modification in the pipeline of test3:

#ifdef PLATFORM_TEGRA
  gst_bin_add_many (GST_BIN (pipeline), queue1, pgie, queue2, tiler, queue3,
      nvvidconv, queue4, transform, sink, NULL);  
  if (!gst_element_link_many (streammux, queue1, pgie, queue2, tiler, queue3,
        nvvidconv, queue4, transform, sink, NULL)) {
    g_printerr ("Elements could not be linked. Exiting.\n");
    return -1;
  }
#else

I removed nvosd from the pipeline.
The result is that I achieved a sensible longer time before the usual reboot time.
Almost 28 hours!!
Here follows the log:

NvRmMemHanldeAllocAttr() or relevant. 
[240320.269192] ------------[ cut here ]------------
[240320.274959] WARNING: CPU: 1 PID: 2140 at /dvs/git/dirty/git-master_linux/kernel/nvgpu/drivers/gpu/nvgpu/gk20a/gk20a.c:64 __gk20a_warn_on_no_regs+0x34/0x50 [nvgpu]
[240320.297296] ---[ end trace 6ca8f5afd7c1b41c ]---
[240320.321214] nvgpu: 57000000.gpu           __nvgpu_check_gpu_state:56   [ERR]  GPU has disappeared from bus!!
[240320.331202] nvgpu: 57000000.gpu           __nvgpu_check_gpu_state:57   [ERR]  Rebooting system!!
[240320.404744] reboot: Restarting system
[0000.159] [L4T TegraBoot] (version 00.00.2018.01-l4t-80a468da)
[0000.165] Processing in cold boot mode Bootloader 2
[0000.169] A02 Bootrom Patch rev = 1023
[0000.173] Power-up reason: software reset
[0000.177] No Battery Present
[0000.179] pmic max77620 reset reason
[0000.183] pmic max77620 NVERC : 0x0
[0000.186] RamCode = 0
[0000.188] Platform has DDR4 type RAM
[0000.192] max77620 disabling SD1 Remote Sense
[0000.196] Setting DDR voltage to 1125mv
[0000.200] Serial Number of Pmic Max77663: 0x1235e9
[0000.208] Entering ramdump check
[0000.211] Get RamDumpCarveOut = 0x0
[0000.214] RamDumpCarveOut=0x0,  RamDumperFlag=0xe59ff3f8
[0000.219] Last reboot was clean, booting normally!
[0000.224] Sdram initialization is successful

I believe that the problem is related to some sort of overheating of a component that is not monitored by system’s temperature sensors.
I have found the following post on the forum: Drive PX2 rebooting at high CPU load
The problem in the previous post was related to the fan NOT spinning, thus the GPU was reaching high temperatures.
In my case the fan is spinning very well and, as you have seen from my previous post, I have also tried a 230V high power fan. Same result.

In any case, today I was looking more closely the log that you can see above. I would like to focus on the following lines:

[0000.173] Power-up reason: software reset
[0000.177] No Battery Present
[0000.179] pmic max77620 reset reason

MAX77620 is a power management IC. Does the reboot might be linked to an overheating of a component?

Thanks again!!