Gk20a and Jetson Nano crash

Hi @WayneWWW,

I start now another session of testing with module A02 and I will let you know.

Here follows the result of lspci -vv on module B01:

00:02.0 PCI bridge: NVIDIA Corporation Device 0faf (rev a1) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 84
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 00001000-00001fff
	Memory behind bridge: 13000000-130fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: <access denied>
	Kernel driver in use: pcieport

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
	Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express Gigabit Ethernet controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 407
	Region 0: I/O ports at 1000 [size=256]
	Region 2: Memory at 13004000 (64-bit, non-prefetchable) [size=4K]
	Region 4: Memory at 13000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8168

Thank you!!

Hi @WayneWWW,

I add to my previous message the latest log of today’s test.
As you have suggested I have run the application with a virgin SD installation on the A02 module.
The module rebooted after 9 hour.
With A02 there is no PCIe error appearing continously.

See the serial log below please:

Ubuntu 18.04.4 LTS jnano-desktop ttyS0



jnano-desktop login: [  224.674391] nvmap_alloc_handle: PID 6898: deepstream-test: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant. 

[  276.701356] nvmap_alloc_handle: PID 7485: deepstream-test: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant. 

[  316.528783] EXT4-fs (mmcblk0p1): error count since last fsck: 1

[  316.534716] EXT4-fs (mmcblk0p1): initial error at time 1517155098: htree_dirblock_to_tree:991: inode 285072: block 1059628

[  316.545825] EXT4-fs (mmcblk0p1): last error at time 1517155098: htree_dirblock_to_tree:991: inode 285072: block 1059628

[31998.894624] ------------[ cut here ]------------

[31998.899472] WARNING: CPU: 1 PID: 7579 at /dvs/git/dirty/git-master_linux/kernel/nvgpu/drivers/gpu/nvgpu/gk20a/gk20a.c:64 __gk20a_warn_on_no_regs+0x34/0x50 [nvgpu]

[31998.914557] ---[ end trace 6647317957369feb ]---

[31998.920633] nvgpu: 57000000.gpu           __nvgpu_check_gpu_state:56   [ERR]  GPU has disappeared from bus!!

[31998.930534] nvgpu: 57000000.gpu           __nvgpu_check_gpu_state:57   [ERR]  Rebooting system!!

[31998.981863] reboot: Restarting system

[0000.159] [L4T TegraBoot] (version 00.00.2018.01-l4t-80a468da)

[0000.165] Processing in cold boot mode Bootloader 2

[0000.170] A02 Bootrom Patch rev = 1023

[0000.173] Power-up reason: software reset

[0000.177] No Battery Present

[0000.179] pmic max77620 reset reason

[0000.183] pmic max77620 NVERC : 0x0

At this point the situation is the following:

A02 + SSD USB3.0 + test3-app = reboot with GPU disappeared
B01 + SSD USB3.0 + test3-app = reboot with GPU disappeared and PCIe error
A02 + new SD + test3-app = reboot with GPU disappeared
B01 + new SD + test3-app = reboot with GPU disappeared and PCIe error

Do you have any suggestion?

Thanks again for the support!!!

Hi borelli.g92,

I will use A02 board to run this test for 1 day today and see if I can reproduce this issue.

As for B01, need to wait a few days before I finding a B01 module.

Also want to know, what is the spec of the power supply you are using?

Dear @WayneWWW,

Thank you very much!

I share with you the source file that I am using as test3.
As anticipated, you will see that I have disabled the standard sink and I have replaced it with a fakesink.
deepstream_test3_app.c (17.1 KB) dstest3_pgie_config.txt (3.5 KB)

I also thought that the problem might have been created by overheating.
I always had a fan active on the heatsink, and A0 temperature never exceeds 45 °C.
However, in order to be sure I have tried to run the system also with a 230V fan please see below pictures.

Unfortunately, nothing change. Thus, I do not think that heating is the problem.

Regarding power supply, I have tried to run the system through two different power supply:

As you can see the power in both cases is definitely higher than needed.

Thank you very much again for your support!!

Hi,

Actually, I am not working on deepstream application so not very sure what is the difference between your case and the original sample.

Thus, I also want to know will you hit this issue even with original deepstream sample from our side?

We need to align the usecase we are running here.

Hi @WayneWWW,

The only difference in the source file is from line 382 to 389:

  /* Finally render the osd output */
#ifdef PLATFORM_TEGRA
  //transform = gst_element_factory_make ("nvegltransform", "nvegl-transform");
  transform = gst_element_factory_make ("queue", "queue");
#endif
  //sink = gst_element_factory_make ("nveglglessink", "nvvideo-renderer");
//  sink = gst_element_factory_make ("nvoverlaysink", "nvvideo-renderer");
  sink = gst_element_factory_make ("fakesink", "nvvideo-renderer");

Regarding the config file you can see that the difference is that I am using the models optimized for Jetson Nano by nvidia.

Both modifications have been done according to what @DaneLLL is suggesting here:

Those modifications are necessary otherwise the Jetson Nano is not able to process the RTSP video stream for it skips a lot of frames.

Thanks again!

ps. I am using only 1 RTSP video source 640*480 resolution at 15FPS.

Dear @WayneWWW,

I have been testing the following configuration:
A02 model + new SD installation + test3-app
With a modification in the pipeline of test3:

#ifdef PLATFORM_TEGRA
  gst_bin_add_many (GST_BIN (pipeline), queue1, pgie, queue2, tiler, queue3,
      nvvidconv, queue4, transform, sink, NULL);  
  if (!gst_element_link_many (streammux, queue1, pgie, queue2, tiler, queue3,
        nvvidconv, queue4, transform, sink, NULL)) {
    g_printerr ("Elements could not be linked. Exiting.\n");
    return -1;
  }
#else 

I removed nvosd from the pipeline.
The result is that I achieved a sensible longer time before the usual reboot time.
Almost 28 hours!!
Here follows the log:

NvRmMemHanldeAllocAttr() or relevant. 
[240320.269192] ------------[ cut here ]------------
[240320.274959] WARNING: CPU: 1 PID: 2140 at /dvs/git/dirty/git-master_linux/kernel/nvgpu/drivers/gpu/nvgpu/gk20a/gk20a.c:64 __gk20a_warn_on_no_regs+0x34/0x50 [nvgpu]
[240320.297296] ---[ end trace 6ca8f5afd7c1b41c ]---
[240320.321214] nvgpu: 57000000.gpu           __nvgpu_check_gpu_state:56   [ERR]  GPU has disappeared from bus!!
[240320.331202] nvgpu: 57000000.gpu           __nvgpu_check_gpu_state:57   [ERR]  Rebooting system!!
[240320.404744] reboot: Restarting system
[0000.159] [L4T TegraBoot] (version 00.00.2018.01-l4t-80a468da)
[0000.165] Processing in cold boot mode Bootloader 2
[0000.169] A02 Bootrom Patch rev = 1023
[0000.173] Power-up reason: software reset
[0000.177] No Battery Present
[0000.179] pmic max77620 reset reason
[0000.183] pmic max77620 NVERC : 0x0
[0000.186] RamCode = 0
[0000.188] Platform has DDR4 type RAM
[0000.192] max77620 disabling SD1 Remote Sense
[0000.196] Setting DDR voltage to 1125mv
[0000.200] Serial Number of Pmic Max77663: 0x1235e9
[0000.208] Entering ramdump check
[0000.211] Get RamDumpCarveOut = 0x0
[0000.214] RamDumpCarveOut=0x0,  RamDumperFlag=0xe59ff3f8
[0000.219] Last reboot was clean, booting normally!
[0000.224] Sdram initialization is successful 

I believe that the problem is related to some sort of overheating of a component that is not monitored by system’s temperature sensors.
I have found the following post on the forum: Drive PX2 rebooting at high CPU load
The problem in the previous post was related to the fan NOT spinning, thus the GPU was reaching high temperatures.
In my case the fan is spinning very well and, as you have seen from my previous post, I have also tried a 230V high power fan. Same result.

In any case, today I was looking more closely the log that you can see above. I would like to focus on the following lines:

[0000.173] Power-up reason: software reset
[0000.177] No Battery Present
[0000.179] pmic max77620 reset reason

MAX77620 is a power management IC. Does the reboot might be linked to an overheating of a component?

Thanks again!!

Hi,

I don’t think you should refer to posts on PX2 because it is using different SoC from Nano.

Actually, we put our A02 device with test3-app for 2 days: on first day, we use the original sample from jetpack and cannot reproduce. Then, we modify it based on your patch for another 19 hours, still cannot reproduce your issue.

Do you configure your board with nvpmodel and jetson_clocks?

Hi @WayneWWW,

I have checked the mode with: sudo nvpmodel -q –verbose
This is the result:

NVPM WARN: fan mode is not set!
NV Power Mode: MAXN
0 

I also have checked the jetson_clocks with: sudo jetson_clocks --show
This is the result:

SOC family:tegra210  Machine:NVIDIA Jetson Nano Developer Kit
Online CPUs: 0-3
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=1132800 IdleStates: WFI=1 c7=1 
cpu1: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=921600 IdleStates: WFI=1 c7=1 
cpu2: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=1036800 IdleStates: WFI=1 c7=1 
cpu3: Online=1 Governor=schedutil MinFreq=102000 MaxFreq=1479000 CurrentFreq=825600 IdleStates: WFI=1 c7=1 
GPU MinFreq=76800000 MaxFreq=921600000 CurrentFreq=76800000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=0
Fan: speed=0
NV Power Mode: MAXN

Do you see some differences with respect to your power configurations?

Thanks again!!

Hi,

I meant running jetson clock to push the clk to limit.

Hi @WayneWWW,

I do not really understand why rising the clock of the GPU might solve this problem in theory.
However, as it is explained here: https://www.jetsonhacks.com/2019/04/10/jetson-nano-use-more-power/
the Jetson_clocks utility influences the behaviour of the Dynamic Voltage and Frequency Scaling (DVFS).

Maybe after many hours of dynamic scaling of the GPU’s voltage that mechanism fail and some kind of brown-out reset intervenes. Just supposing here.
Now I am running the system with frequency-override on:

SOC family:tegra210  Machine:NVIDIA Jetson Nano Developer Kit
Online CPUs: 0-3
CPU Cluster Switching: Disabled
cpu0: Online=1 Governor=schedutil MinFreq=1479000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0 
cpu1: Online=1 Governor=schedutil MinFreq=1479000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0 
cpu2: Online=1 Governor=schedutil MinFreq=1479000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0 
cpu3: Online=1 Governor=schedutil MinFreq=1479000 MaxFreq=1479000 CurrentFreq=1479000 IdleStates: WFI=0 c7=0 
GPU MinFreq=921600000 MaxFreq=921600000 CurrentFreq=921600000
EMC MinFreq=204000000 MaxFreq=1600000000 CurrentFreq=1600000000 FreqOverride=1
Fan: speed=0
NV Power Mode: MAXN

I will let you know how it goes.

Thanks!!!

Hi,

Frankly speaking, I would like you to RMA both of your devices and try to get a new one.
Since we cannot reproduce this issue with same application, then the reason is something else.

But before doing that, I still filed a ticket internally for our team to check this issue.
Maybe we should reproduce it too but just lucky enough to escape that.

How long is the average time to reproduce this issue? Is it >10 hours?

Also, since we are out of ideas, it is worthy checking if this is related to release revision.

For example, the earliest release that supports A02 nano is r32.1, then maybe you can run the application on that release and see if you can still see issue.

If it cannot, then rel-32.2/rel-32.3.1…

Hi @WayneWWW,

what is the procedure for RMA.
Could you kinldy point it out?

Here follows a summary of the tests that I have done:

Date Seconds Hours Model Boot Pwer Supply PCIe error What new
22/08/20 75069 21 A02 SD ATX, 5V, 15A No
23/08/20 95304 26 A02 SD ATX, 5V, 15A No
24/08/20 33548 9 A02 SD ATX, 5V, 15A No
26/08/20 5533 2 A02 USB3.0 SSD ATX, 5V, 15A No I have booted from USB3.0 SSD
01/09/20 22163 6 A02 USB3.0 SSD 5V, 4A No I changed power supply
02/09/20 37080 10 A02 USB3.0 SSD 5V, 4A No
06/09/20 76130 21 B01 USB3.0 SSD 5V, 4A Yes I bought a new board: B01 model
07/09/20 25316 7 B01 USB3.0 SSD 5V, 4A Yes Logged temperature 38-45°C
08/09/20 5705 2 B01 USB3.0 SSD 5V, 4A Yes
09/09/20 44245 12 B01 SD 5V, 4A Yes Virgin SD installation
10/09/20 108365 30 B01 SD 5V, 4A Few No inference running, NO CRASH
11/09/20 31998 9 A02 SD 5V, 4A No Virgin SD installation
12/09/20 19152 5 A02 SD 5V, 4A No Big Fan 230V
14/09/20 96681 27 A02 SD 5V, 4A No No OSD in deepstream pipeline
15/09/20 Running Running A02 SD 5V, 4A No Only PGIE in deepstream pipeline, after 14 hours I run jetson_clocks

You can see the hours for every test that I have run.

Now I am waiting to see what happens in this last test.
I will keep you updated.

Thanks

Many thanks for this summary.

So only B01 board has the pcie error, right?
Does every of them on the list hit the gpu error?

Hi @WayneWWW,

I took some time to get back to you because I wanted to be relatively sure that the problem was solved.
I have tried to run the following configurations

Date Seconds Hours Model Boot Pwer Supply PCIe error What new
17/09/20 82655 22 A02 SD 5V, 4A No JetsonClocks activated, NO CRASH
21/09/20 110970 31 A02 USB3.0 SSD 5V, 4A No JetsonClocks activated, NO CRASH

I have recorded no crash with the jetsonclocks activated!
That solved the problem apparently.
The temperature of the A0 sensor is also lower (35°C on average now).

Also in the previous table, when you read NO CRASH that means that I had no problem. Unfortunately, that happened only when I had NO inference running before discovering that jetsonclocks was solving the problem.

Do you think that the dynamic management of the GPU voltage due to Dynamic Voltage and Frequency Scaling (DVFS) might have created such a malfunction after long usage?

Thanks!!

1 Like

Hi @WayneWWW,

We encountered the same problem. The PCIe error:

[  343.992555] nvmap_alloc_handle: PID 8709: deepstream-app: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant.
[183917.087822] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID)
[183917.098132] pcieport 0000:00:02.0:   device [10de:0faf] error status/mask=00000001/00002000
[183917.106600] pcieport 0000:00:02.0:    [ 0] Receiver Error         (First)
[281785.725034] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID)
[281785.735398] pcieport 0000:00:02.0:   device [10de:0faf] error status/mask=00000001/00002000
[281785.743889] pcieport 0000:00:02.0:    [ 0] Receiver Error         (First)
[282799.166960] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[282799.174265] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G         C      4.9.140-l4t-r32.4.2 #1
[282799.182597] Hardware name: NVIDIA Jetson Nano Developer Kit (DT)
[282799.188675] Call trace:
[282799.191206] [<ffffff800808c490>] dump_backtrace+0x0/0x1b0
[282799.196680] [<ffffff800808c664>] show_stack+0x24/0x30
[282799.201807] [<ffffff800843cd40>] dump_stack+0x98/0xc0
[282799.206935] [<ffffff80081b7dcc>] panic+0x12c/0x290
[282799.211801] [<ffffff8008178d50>] watchdog_nmi_enable+0x0/0x60
[282799.217620] [<ffffff8008177e84>] watchdog_timer_fn+0x8c/0x2a0
[282799.223439] [<ffffff800813b038>] __hrtimer_run_queues+0x120/0x388
[282799.229603] [<ffffff800813b9c8>] hrtimer_interrupt+0xa8/0x1d8
[282799.235423] [<ffffff8008bbc888>] tegra210_timer_isr+0x38/0x48
[282799.241243] [<ffffff80081233d8>] __handle_irq_event_percpu+0x60/0x280
[282799.247755] [<ffffff8008123638>] handle_irq_event_percpu+0x40/0x98
[282799.254007] [<ffffff80081236e0>] handle_irq_event+0x50/0x80
[282799.259653] [<ffffff800812768c>] handle_fasteoi_irq+0xc4/0x1a0
[282799.265558] [<ffffff800812240c>] generic_handle_irq+0x34/0x50
[282799.271376] [<ffffff8008122adc>] __handle_domain_irq+0x6c/0xc0
[282799.277280] [<ffffff8008080db4>] gic_handle_irq+0x54/0xa8
[282799.282752] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[282799.287704] [<ffffff8008b62bf0>] cpuidle_enter_state+0xb8/0x380
[282799.293696] [<ffffff8008b62f2c>] cpuidle_enter+0x34/0x48
[282799.299082] [<ffffff8008113034>] call_cpuidle+0x44/0x68
[282799.304380] [<ffffff8008113374>] cpu_startup_entry+0x18c/0x210
[282799.310286] [<ffffff8008092a24>] secondary_start_kernel+0x13c/0x160
[282799.316624] [<00000000841511a4>] 0x841511a4
[282799.320884] SMP: stopping secondary CPUs
[282800.383788] SMP: failed to stop secondary CPUs 0,3
[282800.388658] Kernel Offset: disabled
[282800.392223] Memory Limit: none
[282800.404857] Rebooting in 1 seconds..
[282801.408567] SMP: stopping secondary CPUs
[282802.469903] SMP: failed to stop secondary CPUs 0,3

I’m not always getting the kernel panic. When the device keeps running and I restart the deepstream application I see following error:

[500510.539999] nvmap_alloc_handle: PID 31476: deepstream-app: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant. 
[500512.260472] nvgpu: 57000000.gpu        gk20a_gr_handle_fecs_error:5281 [ERR]  fecs watchdog triggered for channel 507, cannot ctxsw anymore !!
[500512.273460] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:129  [ERR]  gr_fecs_os_r : 0
[500512.282369] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:131  [ERR]  gr_fecs_cpuctl_r : 0x40
[500512.291864] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:133  [ERR]  gr_fecs_idlestate_r : 0x1
[500512.301491] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:135  [ERR]  gr_fecs_mailbox0_r : 0x1
[500512.310943] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:137  [ERR]  gr_fecs_mailbox1_r : 0x0
[500512.320428] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:139  [ERR]  gr_fecs_irqstat_r : 0x0
[500512.329806] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:141  [ERR]  gr_fecs_irqmode_r : 0x4
[500512.339228] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:143  [ERR]  gr_fecs_irqmask_r : 0x8704
[500512.348900] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:145  [ERR]  gr_fecs_irqdest_r : 0x0
[500512.358284] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:147  [ERR]  gr_fecs_debug1_r : 0x40
[500512.367812] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:149  [ERR]  gr_fecs_debuginfo_r : 0x0
[500512.377414] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:151  [ERR]  gr_fecs_ctxsw_status_1_r : 0xb04
[500512.387700] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(0) : 0x4
[500512.397862] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(1) : 0x0
[500512.407998] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(2) : 0x50009
[500512.418467] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(3) : 0x4000
[500512.428875] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
[500512.439463] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(5) : 0x0
[500512.449597] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(6) : 0x0
[500512.459707] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(7) : 0x0
[500512.469823] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(8) : 0x0
[500512.479988] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(9) : 0x0
[500512.490101] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(10) : 0x0
[500512.500305] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(11) : 0x3
[500512.510565] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(12) : 0x0
[500512.520772] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(13) : 0x0
[500512.531006] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(14) : 0x0
[500512.541219] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(15) : 0x0
[500512.551508] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:159  [ERR]  gr_fecs_engctl_r : 0x0
[500512.560770] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:161  [ERR]  gr_fecs_curctx_r : 0x0
[500512.570019] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:163  [ERR]  gr_fecs_nxtctx_r : 0x0
[500512.579262] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:169  [ERR]  FECS_FALCON_REG_IMB : 0xbadfbadf
[500512.589375] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:175  [ERR]  FECS_FALCON_REG_DMB : 0xbadfbadf
[500512.599480] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:181  [ERR]  FECS_FALCON_REG_CSW : 0xbadfbadf
[500512.609590] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:187  [ERR]  FECS_FALCON_REG_CTX : 0xbadfbadf
[500512.619708] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:193  [ERR]  FECS_FALCON_REG_EXCI : 0xbadfbadf
[500512.629905] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500512.639958] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500512.650032] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500512.660061] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500512.670123] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500512.680146] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500512.690177] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500512.700214] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf

Restarted application again:

[500515.004001] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 507
[500515.014522] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 506
[500515.024836] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 505
[500515.035135] nvgpu: 57000000.gpu   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 504
[500515.045440] nvgpu: 57000000.gpu     gk20a_fifo_handle_sched_error:2531 [ERR]  fifo sched ctxsw timeout error: engine=0, tsg=4, ms=3100
[500515.057902] ---- mlocks ----

[500515.062513] ---- syncpts ----
[500515.065610] id 1 (disp0_a) min 1 max 1 refs 1 (previous client : )
[500515.071915] id 2 (disp0_b) min 1 max 1 refs 1 (previous client : )
[500515.078223] id 3 (disp0_c) min 1 max 1 refs 1 (previous client : )
[500515.084530] id 7 (54340000.vic_0) min 38558471 max 38558471 refs 1 (previous client : 54340000.vic_0)
[500515.093880] id 8 (gm20b_507) min 263544 max 263550 refs 1 (previous client : gm20b_507)
[500515.102010] id 9 (gm20b_506) min 284660 max 284662 refs 1 (previous client : gm20b_506)
[500515.110146] id 11 (gm20b_505) min 190842 max 190842 refs 1 (previous client : gm20b_505)
[500515.118358] id 12 (gm20b_504) min 30917106 max 30917106 refs 1 (previous client : gm20b_504)
[500515.126917] id 13 (gm20b_503) min 576264 max 576264 refs 1 (previous client : gm20b_503)
[500515.135127] id 26 (vblank0) min 30024005 max -2 refs 1 (previous client : )

[500515.143947] ---- channels ----
[500515.147127] 
                channel 0 - 54340000.vic

[500515.153958] 0-54340000.vic (0): 
[500515.157132] active class 01, offset 0000, val 20000000
[500515.162378] DMAPUT 00000bb8, DMAGET 00000bb8, DMACTL 00000000
[500515.168236] CBREAD 20000000, CBSTAT 00010000
[500515.172638] The CDMA sync queue is empty.

[500515.178348] 
                channel 1 - 544c0000.nvenc

[500515.185343] 1-544c0000.nvenc (0): 
[500515.188683] inactive

[500515.192549] 
                ---- host general irq ----

[500515.199545] sync_hintmask_ext = 0xc0000000
[500515.203756] sync_hintmask = 0x80000000
[500515.207617] sync_intc0mask = 0x00000001
[500515.211564] sync_intmask = 0x00000011
[500515.215338] 
                ---- host syncpt irq mask ----

[500515.222683] syncpt_thresh_int_mask(0) = 0x00050001
[500515.227600] syncpt_thresh_int_mask(1) = 0x00000000
[500515.233072] syncpt_thresh_int_mask(2) = 0x00000000
[500515.238095] syncpt_thresh_int_mask(3) = 0x00000000
[500515.243053] syncpt_thresh_int_mask(4) = 0x00000000
[500515.248023] syncpt_thresh_int_mask(5) = 0x00000000
[500515.252982] syncpt_thresh_int_mask(6) = 0x00000000
[500515.257909] syncpt_thresh_int_mask(7) = 0x00000000
[500515.262856] syncpt_thresh_int_mask(8) = 0x00000000
[500515.267851] syncpt_thresh_int_mask(9) = 0x00000000
[500515.272788] syncpt_thresh_int_mask(10) = 0x00000000
[500515.277799] syncpt_thresh_int_mask(11) = 0x00000000
[500515.282826] 
                ---- host syncpt irq status ----

[500515.290369] syncpt_thresh_cpu0_int_status(0) = 0x00000000
[500515.295909] syncpt_thresh_cpu0_int_status(1) = 0x00000000
[500515.301446] syncpt_thresh_cpu0_int_status(2) = 0x00000000
[500515.306987] syncpt_thresh_cpu0_int_status(3) = 0x00000000
[500515.312516] syncpt_thresh_cpu0_int_status(4) = 0x00000000
[500515.318071] syncpt_thresh_cpu0_int_status(5) = 0x00000000
[500515.323601] 
                ---- host syncpt thresh ----

[500515.330818] syncpt_int_thresh_thresh_0(0) = 1
[500515.335732] syncpt_int_thresh_thresh_0(8) = 263546
[500515.340684] syncpt_int_thresh_thresh_0(9) = 284662
[500515.345738] gm20b pbdma 0: 
[500515.348505] id: 4 (tsg), next_id: 4 (tsg) chan status: invalid
[500515.354514] PBDMA_PUT: 0000001f0004a308 PBDMA_GET: 0000001f0004a308 GP_PUT: 00000d96 GP_GET: 00000d96 FETCH: 00000d96 HEADER: 60400000
                HDR: 00000000 SHADOW0: 0004a2f0 SHADOW1: 0000181f

[500515.372571] gm20b eng 0: 
[500515.375152] id: 4 (tsg), next_id: 4 (tsg), ctx status: save 
[500515.380949] busy 

[500515.383060] gm20b eng 1: 
[500515.385622] id: 5 (tsg), next_id: 5 (tsg), ctx status: valid 


[500515.395281] 503-gm20b, pid 31476, refs 2: 
[500515.399352] channel status:  in use idle not busy
[500515.404202] RAMFC : TOP: 8000001f00280078 PUT: 0000001f00280078 GET: 0000001f00280078 FETCH: 0000001f00280078
                HEADER: 60400000 COUNT: 80000000
                SYNCPOINT 00000000 00000d01 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.428064] 504-gm20b, pid 31476, refs 2: 
[500515.432339] channel status:  in use idle not busy
[500515.437168] RAMFC : TOP: 8000001f00240018 PUT: 0000001f00240018 GET: 0000001f00240018 FETCH: 0000001f00240018
                HEADER: 60400000 COUNT: 80000000
                SYNCPOINT 00000000 00000c01 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.461002] 505-gm20b, pid 31476, refs 2: 
[500515.465025] channel status:  in use idle not busy
[500515.469839] RAMFC : TOP: 8000001f00200018 PUT: 0000001f00200018 GET: 0000001f00200018 FETCH: 0000001f00200018
                HEADER: 60400000 COUNT: 80000000
                SYNCPOINT 00000000 00000b01 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.493811] 506-gm20b, pid 31476, refs 4: 
[500515.497842] channel status:  in use pending busy
[500515.502582] RAMFC : TOP: 8000001f00140018 PUT: 0000001f00140018 GET: 0000001f00140018 FETCH: 0000001f00140018
                HEADER: 60400000 COUNT: 80000000
                SYNCPOINT 00000000 00000901 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.526437] 507-gm20b, pid 31476, refs 8: 
[500515.530645] channel status:  in use pending busy
[500515.536146] RAMFC : TOP: 8000001f0004a308 PUT: 0000001f0004a308 GET: 0000001f0004a308 FETCH: 0000001f0004a308
                HEADER: 60400000 COUNT: 80000000
                SYNCPOINT 00000000 00000801 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.560156] 508-gm20b, pid 4158, refs 2: 
[500515.564123] channel status:  in use idle not busy
[500515.568962] RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                HEADER: 60400000 COUNT: 00000000
                SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.592813] 509-gm20b, pid 4158, refs 2: 
[500515.596858] channel status:  in use idle not busy
[500515.601705] RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                HEADER: 60400000 COUNT: 00000000
                SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.625558] 510-gm20b, pid 4158, refs 2: 
[500515.629510] channel status:  in use idle not busy
[500515.634474] RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                HEADER: 60400000 COUNT: 00000000
                SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.658345] 511-gm20b, pid 4158, refs 2: 
[500515.662292] channel status:  in use idle not busy
[500515.667119] RAMFC : TOP: 0000000000000000 PUT: 0000000000000000 GET: 0000000000000000 FETCH: 0000000000000000
                HEADER: 60400000 COUNT: 00000000
                SYNCPOINT 00000000 00000000 SEMAPHORE 00000000 00000000 00000000 00000000

[500515.691213] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1721 [ERR]  fake mmu fault on engine 0, engine subid 1 (hub), client 11 (mspdec), addr 0x6e6b147000, type 2 (pte), access_type 0x00000000,inst_ptr 0xac8f4000
[500515.711201] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:129  [ERR]  gr_fecs_os_r : 0
[500515.720000] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:131  [ERR]  gr_fecs_cpuctl_r : 0x40
[500515.729320] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:133  [ERR]  gr_fecs_idlestate_r : 0x1
[500515.738828] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:135  [ERR]  gr_fecs_mailbox0_r : 0x1
[500515.748311] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:137  [ERR]  gr_fecs_mailbox1_r : 0x0
[500515.757745] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:139  [ERR]  gr_fecs_irqstat_r : 0x0
[500515.767169] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:141  [ERR]  gr_fecs_irqmode_r : 0x4
[500515.776516] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:143  [ERR]  gr_fecs_irqmask_r : 0x8704
[500515.786090] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:145  [ERR]  gr_fecs_irqdest_r : 0x0
[500515.795532] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:147  [ERR]  gr_fecs_debug1_r : 0x40
[500515.805000] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:149  [ERR]  gr_fecs_debuginfo_r : 0x0
[500515.814500] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:151  [ERR]  gr_fecs_ctxsw_status_1_r : 0xb04
[500515.824598] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(0) : 0x4
[500515.834703] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(1) : 0x0
[500515.844798] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(2) : 0x50009
[500515.855246] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(3) : 0x4000
[500515.865606] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
[500515.876137] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(5) : 0x0
[500515.886230] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(6) : 0x0
[500515.896328] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(7) : 0x0
[500515.906426] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(8) : 0x0
[500515.916530] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(9) : 0x0
[500515.926622] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(10) : 0x0
[500515.936823] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(11) : 0x3
[500515.947016] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(12) : 0x0
[500515.957204] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(13) : 0x0
[500515.967381] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(14) : 0x0
[500515.977568] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(15) : 0x0
[500515.987752] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:159  [ERR]  gr_fecs_engctl_r : 0x0
[500515.996983] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:161  [ERR]  gr_fecs_curctx_r : 0x0
[500516.006216] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:163  [ERR]  gr_fecs_nxtctx_r : 0x0
[500516.015528] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:169  [ERR]  FECS_FALCON_REG_IMB : 0xbadfbadf
[500516.025635] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:175  [ERR]  FECS_FALCON_REG_DMB : 0xbadfbadf
[500516.035738] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:181  [ERR]  FECS_FALCON_REG_CSW : 0xbadfbadf
[500516.045832] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:187  [ERR]  FECS_FALCON_REG_CTX : 0xbadfbadf
[500516.055932] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:193  [ERR]  FECS_FALCON_REG_EXCI : 0xbadfbadf
[500516.066112] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500516.076124] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500516.086135] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500516.096140] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500516.106149] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500516.116158] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500516.126169] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[500516.136176] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[500516.146181] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1726 [ERR]  gr_status_r : 0x81
[500516.156404] nvgpu: 57000000.gpu                    fifo_error_isr:2605 [ERR]  channel reset initiated from fifo_error_isr; intr=0x00000100
[500532.880044]

Will you guys please look into the issue? You found any problems in the frequency scaling governour?

Could you confirm if using jetson clocks can also resolve the error too?

The device is running for 16 hours and I ran jetson_clocks at boot. As we see in previous logs the error occurred after 51 hours (183917 seconds). So I can confirm on Monday.

Are there any side effects for temperature? At the moment I don’t see an issue but the environment temperature is only 11 deg Celsius, in the summer the environment temperature can be alot higher.

You guys looked into the DVFS governor already?

Hi,

are you using a fan to cool down the Nano?
While I was running the tests I was logging the temperatures with a modified version of this script in python: https://github.com/tsutof/jetson-thermal-monitor
Maybe that might be helpful to undestand whether the temperature is reaching high levels.

However, I have never encountered your errors.