Jetson Xavier AGX nvgpu_timeout_expired

Hi,

I have Jetpack 4.4.1 on this and 4.3 on two others, same issues.
I have to flash it to go ahead with your pure suggestion.

We have 4 Xavier with the same issue, We have another one that has an older jetpack (R32 (release), REVISION: 2.1) with all same application running and the same LTE module and drivers that works for several months without any issue.

I removed the ‘quiet’, I think I don’t need to do a reboot to make it effective?

When it gets rebooted the OS will boot normally.
We even tried to stop our application to see what happens even if we don’t run any algorithm, still we had the same issue.

Hi,

I have to flash it to go ahead with your pure suggestion.

Could you also tell us what else are there besides the LTE driver?

We have 4 Xavier with the same issue, We have another one that has an older jetpack (R32 (release), REVISION: 2.1) with all same application running and the same LTE module and drivers that works for several months without any issue.

Do you mean the rel-32.2.1 is working fine?

I removed the ‘quiet’, I think I don’t need to do a reboot to make it effective?

You need to reboot to make it take effect.

When it gets rebooted the OS will boot normally.
We even tried to stop our application to see what happens even if we don’t run any algorithm, still we had the same issue.
You need to reboot to make it effective.

So this issue is still triggered by the application, right? Can you describe what else is running if there is no algorithm.

The only drivers that I add/modify is QMI_WWAN and Option to have LTE. Nothing else.

Yes, That works fine.

No matter if we run the python3 application (that uses GPU) or we don’t run, still we see this issue.
No other applications are running except the Network manager that takes care of the internet connection.
The rest is basically what Ubuntu has when it boots.

But didn’t you tell us you already disabled the driver in previous comments? Then it sounds like the BSP is pure now.

No matter if we run the python3 application (that uses GPU) or we don’t run, still we see this issue.
No other applications are running except the Network manager that takes care of the internet connection.

Why do you need to highlight Network manager here? Do you mean the LTE is still running?

No, I haven’t disabled the LTE yet.
I will flash the system tomorrow and leave it running and see what happens.

1 Like

Ok. Then I guess we need to figure it out in below steps

First, clarifying whether this issue is triggered by extra LTE modules/drivers. If your application can still run well without LTE (maybe using etherent to transfer data) then the application itself is not the cause.

Second, put LTE modules back and only run some network activity to see if you can hit this issue again.

Hi, we encounter a similar issue and do not have any LTE modifications on a JP4.4 kernel:

Jan 28 17:07:24 client-14417 kernel: [  553.596676] INFO: rcu_preempt self-detected stall on CPU
Jan 28 17:07:24 client-14417 kernel: [  553.596891]     0-...: (5209 ticks this GP) idle=1dd/140000000000002/0 softirq=63308/63308 fqs=2550
Jan 28 17:07:24 client-14417 kernel: [  553.597073]      (t=5251 jiffies g=38367 c=38366 q=1146)
Jan 28 17:07:24 client-14417 kernel: [  553.597180] Task dump for CPU 0:
Jan 28 17:07:24 client-14417 kernel: [  553.597188] ksoftirqd/0     R  running task        0     3      2 0x00000002
Jan 28 17:07:24 client-14417 kernel: [  553.597199] Call trace:
Jan 28 17:07:24 client-14417 kernel: [  553.597212] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
Jan 28 17:07:24 client-14417 kernel: [  553.597219] [<ffffff800808c37c>] show_stack+0x24/0x30
Jan 28 17:07:24 client-14417 kernel: [  553.597243] [<ffffff80080ecf70>] sched_show_task+0xf8/0x148
Jan 28 17:07:24 client-14417 kernel: [  553.597250] [<ffffff80080efc70>] dump_cpu_task+0x48/0x58
Jan 28 17:07:24 client-14417 kernel: [  553.597257] [<ffffff80081c1acc>] rcu_dump_cpu_stacks+0xb8/0xec
Jan 28 17:07:24 client-14417 kernel: [  553.597264] [<ffffff8008132450>] rcu_check_callbacks+0x728/0xa48
Jan 28 17:07:24 client-14417 kernel: [  553.597269] [<ffffff8008138cac>] update_process_times+0x34/0x60
Jan 28 17:07:24 client-14417 kernel: [  553.597290] [<ffffff800814a218>] tick_sched_handle.isra.5+0x38/0x70
Jan 28 17:07:24 client-14417 kernel: [  553.597295] [<ffffff800814a29c>] tick_sched_timer+0x4c/0x90
Jan 28 17:07:24 client-14417 kernel: [  553.597299] [<ffffff80081399e0>] __hrtimer_run_queues+0xd8/0x360
Jan 28 17:07:24 client-14417 kernel: [  553.597303] [<ffffff800813a330>] hrtimer_interrupt+0xa8/0x1e0
Jan 28 17:07:24 client-14417 kernel: [  553.597310] [<ffffff8008bff910>] arch_timer_handler_phys+0x38/0x58
Jan 28 17:07:24 client-14417 kernel: [  553.597316] [<ffffff8008126f10>] handle_percpu_devid_irq+0x90/0x2b0
Jan 28 17:07:24 client-14417 kernel: [  553.597320] [<ffffff80081214f4>] generic_handle_irq+0x34/0x50
Jan 28 17:07:24 client-14417 kernel: [  553.597325] [<ffffff8008121bd8>] __handle_domain_irq+0x68/0xc0
Jan 28 17:07:24 client-14417 kernel: [  553.597329] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
Jan 28 17:07:24 client-14417 kernel: [  553.597332] [<ffffff8008082c28>] el1_irq+0xe8/0x194
Jan 28 17:07:24 client-14417 kernel: [  553.597350] [<ffffff8008dfaf30>] ip_local_deliver_finish+0x80/0x278
Jan 28 17:07:24 client-14417 kernel: [  553.597355] [<ffffff8008dfb67c>] ip_local_deliver+0x54/0xf0                                                        
Jan 28 17:07:24 client-14417 kernel: [  553.597359] [<ffffff8008dfb300>] ip_rcv_finish+0x1d8/0x3a0                                                         
Jan 28 17:07:24 client-14417 kernel: [  553.597364] [<ffffff8008dfb988>] ip_rcv+0x270/0x3a8                                                                
Jan 28 17:07:24 client-14417 kernel: [  553.597370] [<ffffff8008da96a0>] __netif_receive_skb_core+0x3b8/0xad8                                              
Jan 28 17:07:24 client-14417 kernel: [  553.597375] [<ffffff8008daca90>] __netif_receive_skb+0x28/0x78                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597379] [<ffffff8008dacb0c>] netif_receive_skb_internal+0x2c/0xb0                                              
Jan 28 17:07:24 client-14417 kernel: [  553.597383] [<ffffff8008dad734>] napi_gro_receive+0x15c/0x188                                                      
Jan 28 17:07:24 client-14417 kernel: [  553.597390] [<ffffff800894dd90>] eqos_napi_poll_rx+0x358/0x430                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597394] [<ffffff8008daed64>] net_rx_action+0xf4/0x358                                                          
Jan 28 17:07:24 client-14417 kernel: [  553.597399] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0                                                          
Jan 28 17:07:24 client-14417 kernel: [  553.597405] [<ffffff80080bb218>] irq_exit+0xd0/0x118                                                               
Jan 28 17:07:24 client-14417 kernel: [  553.597409] [<ffffff8008121bdc>] __handle_domain_irq+0x6c/0xc0                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597413] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0                                                          
Jan 28 17:07:24 client-14417 kernel: [  553.597417] [<ffffff8008082c28>] el1_irq+0xe8/0x194                                                                
Jan 28 17:07:24 client-14417 kernel: [  553.597422] [<ffffff80080baf3c>] run_ksoftirqd+0x4c/0x58                                                           
Jan 28 17:07:24 client-14417 kernel: [  553.597441] [<ffffff80080e07c8>] smpboot_thread_fn+0x160/0x248                                                     
Jan 28 17:07:24 client-14417 kernel: [  553.597445] [<ffffff80080dbe64>] kthread+0xec/0xf0                                                                 
Jan 28 17:07:24 client-14417 kernel: [  553.597450] [<ffffff80080838a0>] ret_from_fork+0x10/0x30                                                           
Jan 28 17:07:24 client-14417 kernel: [  553.600693] INFO: rcu_sched detected stalls on CPUs/tasks:                                                         
Jan 28 17:07:24 client-14417 kernel: [  553.600855]     0-...: (5206 ticks this GP) idle=1dd/140000000000002/0 softirq=63308/63308 fqs=2540                
Jan 28 17:07:24 client-14417 kernel: [  553.601004]     (detected by 1, t=5252 jiffies, g=1361, c=1360, q=24)                                              
Jan 28 17:07:24 client-14417 kernel: [  553.601124] Task dump for CPU 0:                                                                                   
Jan 28 17:07:24 client-14417 kernel: [  553.601129] ksoftirqd/0     R  running task        0     3      2 0x00000002                                       
Jan 28 17:07:24 client-14417 kernel: [  553.601139] Call trace:                                                                                            
Jan 28 17:07:24 client-14417 kernel: [  553.601154] [<ffffff80080863bc>] __switch_to+0x9c/0xc0                                                             
Jan 28 17:07:24 client-14417 kernel: [  553.601161] [<ffffff80082333c0>] kmem_cache_free+0x298/0x2e8                                                       
Jan 28 17:07:24 client-14417 kernel: [  553.601168] [<ffffff8008db9680>] dst_destroy+0x148/0x168                                                           
Jan 28 17:07:24 client-14417 kernel: [  553.601173] [<ffffff8008130308>] note_gp_changes+0x80/0xc0                                                         
Jan 28 17:07:24 client-14417 kernel: [  553.601178] [<ffffff80081304f8>] rcu_process_callbacks+0xb8/0x688                                                  
Jan 28 17:07:24 client-14417 kernel: [  553.601182] [<00ffffffffffffff>] 0xffffffffffffff

@fransklaver could you please provide more details about any devices that you have connected to your system and any specific kernel module you added, or it is just mainly the original Jetpack you burned.
Then at least we can find a similarity between our systems

Thanks.

We have some cameras and their supporting modules attached to the xaviers. The thing is that we seem to have a couple of devices on which this stall happens fairly reliably, but others, with exactly the same software and hardware, do not seem to have this behavior.

I would suggest you can file a new topic. Apparently your usecase is different from the original case.

Also, please share detail about your issue. The detail includes

  1. The jetapck version you are using.
  2. The usecase you are running
  3. Is it custom board or nv devkit?
  4. If this is custom board, is it possible to reproduce error on devkit?

We have seen a similar pattern, we have in total 8 Xavier that running exactly the same JEtpack, kernel, and application, three of them showing this random reboot, but 5 of them are working without any issue even after several weeks running.