Jetson Xavier AGX nvgpu_timeout_expired

asad.lesani · December 9, 2020, 3:03am

Hi,

I have Jetpack 4.4.1 on this and 4.3 on two others, same issues.
I have to flash it to go ahead with your pure suggestion.

We have 4 Xavier with the same issue, We have another one that has an older jetpack (R32 (release), REVISION: 2.1) with all same application running and the same LTE module and drivers that works for several months without any issue.

I removed the ‘quiet’, I think I don’t need to do a reboot to make it effective?

When it gets rebooted the OS will boot normally.
We even tried to stop our application to see what happens even if we don’t run any algorithm, still we had the same issue.

WayneWWW · December 9, 2020, 3:14am

Hi,

I have to flash it to go ahead with your pure suggestion.

Could you also tell us what else are there besides the LTE driver?

We have 4 Xavier with the same issue, We have another one that has an older jetpack (R32 (release), REVISION: 2.1) with all same application running and the same LTE module and drivers that works for several months without any issue.

Do you mean the rel-32.2.1 is working fine?

I removed the ‘quiet’, I think I don’t need to do a reboot to make it effective?

You need to reboot to make it take effect.

When it gets rebooted the OS will boot normally.
We even tried to stop our application to see what happens even if we don’t run any algorithm, still we had the same issue.
You need to reboot to make it effective.

So this issue is still triggered by the application, right? Can you describe what else is running if there is no algorithm.

asad.lesani · December 9, 2020, 3:26am

The only drivers that I add/modify is QMI_WWAN and Option to have LTE. Nothing else.

Yes, That works fine.

No matter if we run the python3 application (that uses GPU) or we don’t run, still we see this issue.
No other applications are running except the Network manager that takes care of the internet connection.
The rest is basically what Ubuntu has when it boots.

WayneWWW · December 9, 2020, 3:39am

But didn’t you tell us you already disabled the driver in previous comments? Then it sounds like the BSP is pure now.

No matter if we run the python3 application (that uses GPU) or we don’t run, still we see this issue.
No other applications are running except the Network manager that takes care of the internet connection.

Why do you need to highlight Network manager here? Do you mean the LTE is still running?

asad.lesani · December 9, 2020, 3:42am

No, I haven’t disabled the LTE yet.
I will flash the system tomorrow and leave it running and see what happens.

WayneWWW · December 9, 2020, 3:49am

Ok. Then I guess we need to figure it out in below steps

First, clarifying whether this issue is triggered by extra LTE modules/drivers. If your application can still run well without LTE (maybe using etherent to transfer data) then the application itself is not the cause.

Second, put LTE modules back and only run some network activity to see if you can hit this issue again.

fransklaver · December 21, 2020, 3:35pm

Hi, we encounter a similar issue and do not have any LTE modifications on a JP4.4 kernel:

Jan 28 17:07:24 client-14417 kernel: [  553.596676] INFO: rcu_preempt self-detected stall on CPU
Jan 28 17:07:24 client-14417 kernel: [  553.596891]     0-...: (5209 ticks this GP) idle=1dd/140000000000002/0 softirq=63308/63308 fqs=2550
Jan 28 17:07:24 client-14417 kernel: [  553.597073]      (t=5251 jiffies g=38367 c=38366 q=1146)
Jan 28 17:07:24 client-14417 kernel: [  553.597180] Task dump for CPU 0:
Jan 28 17:07:24 client-14417 kernel: [  553.597188] ksoftirqd/0     R  running task        0     3      2 0x00000002
Jan 28 17:07:24 client-14417 kernel: [  553.597199] Call trace:
Jan 28 17:07:24 client-14417 kernel: [  553.597212] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
Jan 28 17:07:24 client-14417 kernel: [  553.597219] [<ffffff800808c37c>] show_stack+0x24/0x30
Jan 28 17:07:24 client-14417 kernel: [  553.597243] [<ffffff80080ecf70>] sched_show_task+0xf8/0x148
Jan 28 17:07:24 client-14417 kernel: [  553.597250] [<ffffff80080efc70>] dump_cpu_task+0x48/0x58
Jan 28 17:07:24 client-14417 kernel: [  553.597257] [<ffffff80081c1acc>] rcu_dump_cpu_stacks+0xb8/0xec
Jan 28 17:07:24 client-14417 kernel: [  553.597264] [<ffffff8008132450>] rcu_check_callbacks+0x728/0xa48
Jan 28 17:07:24 client-14417 kernel: [  553.597269] [<ffffff8008138cac>] update_process_times+0x34/0x60
Jan 28 17:07:24 client-14417 kernel: [  553.597290] [<ffffff800814a218>] tick_sched_handle.isra.5+0x38/0x70
Jan 28 17:07:24 client-14417 kernel: [  553.597295] [<ffffff800814a29c>] tick_sched_timer+0x4c/0x90
Jan 28 17:07:24 client-14417 kernel: [  553.597299] [<ffffff80081399e0>] __hrtimer_run_queues+0xd8/0x360
Jan 28 17:07:24 client-14417 kernel: [  553.597303] [<ffffff800813a330>] hrtimer_interrupt+0xa8/0x1e0
Jan 28 17:07:24 client-14417 kernel: [  553.597310] [<ffffff8008bff910>] arch_timer_handler_phys+0x38/0x58
Jan 28 17:07:24 client-14417 kernel: [  553.597316] [<ffffff8008126f10>] handle_percpu_devid_irq+0x90/0x2b0
Jan 28 17:07:24 client-14417 kernel: [  553.597320] [<ffffff80081214f4>] generic_handle_irq+0x34/0x50
Jan 28 17:07:24 client-14417 kernel: [  553.597325] [<ffffff8008121bd8>] __handle_domain_irq+0x68/0xc0
Jan 28 17:07:24 client-14417 kernel: [  553.597329] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
Jan 28 17:07:24 client-14417 kernel: [  553.597332] [<ffffff8008082c28>] el1_irq+0xe8/0x194
Jan 28 17:07:24 client-14417 kernel: [  553.597350] [<ffffff8008dfaf30>] ip_local_deliver_finish+0x80/0x278
Jan 28 17:07:24 client-14417 kernel: [  553.597355] [<ffffff8008dfb67c>] ip_local_deliver+0x54/0xf0                                                        
Jan 28 17:07:24 client-14417 kernel: [  553.597359] [<ffffff8008dfb300>] ip_rcv_finish+0x1d8/0x3a0                                                         
Jan 28 17:07:24 client-14417 kernel: [  553.597364] [<ffffff8008dfb988>] ip_rcv+0x270/0x3a8                                                                
Jan 28 17:07:24 client-14417 kernel: [  553.597370] [<ffffff8008da96a0>] __netif_receive_skb_core+0x3b8/0xad8                                              
Jan 28 17:07:24 client-14417 kernel: [  553.597375] [<ffffff8008daca90>] __netif_receive_skb+0x28/0x78                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597379] [<ffffff8008dacb0c>] netif_receive_skb_internal+0x2c/0xb0                                              
Jan 28 17:07:24 client-14417 kernel: [  553.597383] [<ffffff8008dad734>] napi_gro_receive+0x15c/0x188                                                      
Jan 28 17:07:24 client-14417 kernel: [  553.597390] [<ffffff800894dd90>] eqos_napi_poll_rx+0x358/0x430                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597394] [<ffffff8008daed64>] net_rx_action+0xf4/0x358                                                          
Jan 28 17:07:24 client-14417 kernel: [  553.597399] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0                                                          
Jan 28 17:07:24 client-14417 kernel: [  553.597405] [<ffffff80080bb218>] irq_exit+0xd0/0x118                                                               
Jan 28 17:07:24 client-14417 kernel: [  553.597409] [<ffffff8008121bdc>] __handle_domain_irq+0x6c/0xc0                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597413] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0                                                          
Jan 28 17:07:24 client-14417 kernel: [  553.597417] [<ffffff8008082c28>] el1_irq+0xe8/0x194                                                                
Jan 28 17:07:24 client-14417 kernel: [  553.597422] [<ffffff80080baf3c>] run_ksoftirqd+0x4c/0x58                                                           
Jan 28 17:07:24 client-14417 kernel: [  553.597441] [<ffffff80080e07c8>] smpboot_thread_fn+0x160/0x248                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597445] [<ffffff80080dbe64>] kthread+0xec/0xf0                                                                 
Jan 28 17:07:24 client-14417 kernel: [  553.597450] [<ffffff80080838a0>] ret_from_fork+0x10/0x30                                                           
Jan 28 17:07:24 client-14417 kernel: [  553.600693] INFO: rcu_sched detected stalls on CPUs/tasks:                                                         
Jan 28 17:07:24 client-14417 kernel: [  553.600855]     0-...: (5206 ticks this GP) idle=1dd/140000000000002/0 softirq=63308/63308 fqs=2540                
Jan 28 17:07:24 client-14417 kernel: [  553.601004]     (detected by 1, t=5252 jiffies, g=1361, c=1360, q=24)                                              
Jan 28 17:07:24 client-14417 kernel: [  553.601124] Task dump for CPU 0:                                                                                   
Jan 28 17:07:24 client-14417 kernel: [  553.601129] ksoftirqd/0     R  running task        0     3      2 0x00000002                                       
Jan 28 17:07:24 client-14417 kernel: [  553.601139] Call trace:                                                                                            
Jan 28 17:07:24 client-14417 kernel: [  553.601154] [<ffffff80080863bc>] __switch_to+0x9c/0xc0                                                             
Jan 28 17:07:24 client-14417 kernel: [  553.601161] [<ffffff80082333c0>] kmem_cache_free+0x298/0x2e8                                                       
Jan 28 17:07:24 client-14417 kernel: [  553.601168] [<ffffff8008db9680>] dst_destroy+0x148/0x168                                                           
Jan 28 17:07:24 client-14417 kernel: [  553.601173] [<ffffff8008130308>] note_gp_changes+0x80/0xc0                                                         
Jan 28 17:07:24 client-14417 kernel: [  553.601178] [<ffffff80081304f8>] rcu_process_callbacks+0xb8/0x688                                                  
Jan 28 17:07:24 client-14417 kernel: [  553.601182] [<00ffffffffffffff>] 0xffffffffffffff

asad.lesani · December 22, 2020, 1:59am

@fransklaver could you please provide more details about any devices that you have connected to your system and any specific kernel module you added, or it is just mainly the original Jetpack you burned.
Then at least we can find a similarity between our systems

Thanks.

fransklaver · December 22, 2020, 7:52am

We have some cameras and their supporting modules attached to the xaviers. The thing is that we seem to have a couple of devices on which this stall happens fairly reliably, but others, with exactly the same software and hardware, do not seem to have this behavior.

WayneWWW · December 28, 2020, 3:00am

I would suggest you can file a new topic. Apparently your usecase is different from the original case.

Also, please share detail about your issue. The detail includes

The jetapck version you are using.
The usecase you are running
Is it custom board or nv devkit?
If this is custom board, is it possible to reproduce error on devkit?

asad.lesani · December 29, 2020, 4:42am

We have seen a similar pattern, we have in total 8 Xavier that running exactly the same JEtpack, kernel, and application, three of them showing this random reboot, but 5 of them are working without any issue even after several weeks running.

Topic		Replies	Views
Jetson AGX Xavier self rebooting Jetson AGX Xavier boot	53	5841	October 18, 2021
Jetson AGX Xavier suddenly reboot issue Jetson AGX Xavier boot	17	1687	October 18, 2021
Xavier DEBUG UART dead. What can I do to debug this issue? Jetson AGX Xavier	10	1871	October 18, 2021
Jetson AGX Xavier keeps rebooting and cannot enter the system Jetson AGX Xavier boot	10	1145	October 18, 2021
Cannot Flash Jetson AGX Xavier. Video Output only shows blinking NVidiA Logo Jetson AGX Xavier boot	4	1035	June 25, 2021
Unscheduled reboots of Jetson Xavier AGX DevKit Jetson AGX Xavier boot	25	2279	November 2, 2021
Xavier: reboot error Jetson AGX Xavier kernel	2	592	October 18, 2021
Crash NVIDIA JETSON Jetson AGX Xavier power	11	839	October 18, 2021
AGX Xavier operation on custom carrier board: "tegra-i2c 31c0000.i2c: rx dma timeout txlen:28 rxlen:128" Jetson AGX Xavier boot , board-design	3	1093	June 25, 2021
Xavier AGX can't power off Jetson AGX Xavier boot	7	955	January 19, 2022

Jetson Xavier AGX nvgpu_timeout_expired

Related topics