CPU/SSH stalled while running gstreamer pipeline in maxn mode

When I run the below gstreamer pipeline for a while, cpu or ssh connection will stall and sometimes devkit will auto reboot.

OS release is R32.4.4, and nvpmodel is maxn with fan/cool.

Gstreamer pipeline

gst-launch-1.0 nvcompositor name=comp \
        sink_0::xpos=0 sink_0::ypos=0 sink_0::width=640 sink_0::height=360 \
        sink_1::xpos=640 sink_1::ypos=0 sink_1::width=640 sink_1::height=360 \
        sink_2::xpos=0 sink_2::ypos=720 sink_2::width=320 sink_2::height=180 \
        ! "video/x-raw(memory:NVMM),width=2560,height=900" ! nv3dsink \
        videotestsrc pattern=ball ! video/x-raw,width=640,height=360 ! nvvidconv ! comp. \
        videotestsrc pattern=snow ! video/x-raw,width=640,height=360 ! nvvidconv ! comp. \
        videotestsrc ! video/x-raw,width=320,height=180 ! nvvidconv ! comp.

I found bluetooth interrupts count increased faster and faster, CPU 0 usage rised to 100%

grep blue /proc/interrupts ; sleep 10 ; grep blue /proc/interrupts 
 392:   33350325          0          0          0          0          0          0          0  tegra-gpio  192 Edge      bluetooth hostwake
 392:   34011500          0          0          0          0          0          0          0  tegra-gpio  192 Edge      bluetooth hostwake

There are some error / backtrace messages from kernel

[ 1011.854841] nvgpu: 17000000.gv11b    gk20a_fifo_handle_pbdma_intr_0:2722 [ERR]  semaphore acquire timeout!
[ 1011.855034] nvgpu: 17000000.gv11b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 24 for ch 509

and …

[ 1634.735585] bpmp: mrq 22 took 2200000 us
[ 1648.016330] bpmp: mrq 22 took 3996000 us
[ 1746.936375] INFO: rcu_preempt self-detected stall on CPU
[ 1746.936540] 	0-...: (1 GPs behind) idle=feb/140000000000002/0 softirq=48488/48488 fqs=2149 
[ 1746.936692] 	 (t=5250 jiffies g=70901 c=70900 q=22672)
[ 1746.936807] Task dump for CPU 0:
[ 1746.936812] ksoftirqd/0     R  running task        0     3      2 0x00000002
[ 1746.936827] Call trace:
[ 1746.936847] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[ 1746.936855] [<ffffff800808c37c>] show_stack+0x24/0x30
[ 1746.936865] [<ffffff80080ecf70>] sched_show_task+0xf8/0x148
[ 1746.936871] [<ffffff80080efc70>] dump_cpu_task+0x48/0x58
[ 1746.936881] [<ffffff80081c1acc>] rcu_dump_cpu_stacks+0xb8/0xec
[ 1746.936892] [<ffffff8008132450>] rcu_check_callbacks+0x728/0xa48
[ 1746.936900] [<ffffff8008138cac>] update_process_times+0x34/0x60
[ 1746.936909] [<ffffff800814a218>] tick_sched_handle.isra.5+0x38/0x70
[ 1746.936914] [<ffffff800814a29c>] tick_sched_timer+0x4c/0x90
[ 1746.936920] [<ffffff80081399e0>] __hrtimer_run_queues+0xd8/0x360
[ 1746.936924] [<ffffff800813a330>] hrtimer_interrupt+0xa8/0x1e0
[ 1746.936936] [<ffffff8008bffe98>] arch_timer_handler_phys+0x38/0x58
[ 1746.936945] [<ffffff8008126f10>] handle_percpu_devid_irq+0x90/0x2b0
[ 1746.936951] [<ffffff80081214f4>] generic_handle_irq+0x34/0x50
[ 1746.936956] [<ffffff8008121bd8>] __handle_domain_irq+0x68/0xc0
[ 1746.936961] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[ 1746.936965] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[ 1746.936976] [<ffffff8008e00f04>] ip_local_out+0x34/0x68
[ 1746.936982] [<ffffff8008e01214>] ip_queue_xmit+0x124/0x398
[ 1746.936988] [<ffffff8008e1c314>] __tcp_transmit_skb+0x59c/0x980
[ 1746.936994] [<ffffff8008e1e180>] __tcp_send_ack.part.7+0xf0/0x140
[ 1746.936999] [<ffffff8008e1fb54>] tcp_send_ack+0x34/0x40
[ 1746.937006] [<ffffff8008e1075c>] __tcp_ack_snd_check+0x54/0xb0
[ 1746.937011] [<ffffff8008e1874c>] tcp_rcv_established+0x284/0x7b8
[ 1746.937016] [<ffffff8008e22330>] tcp_v4_do_rcv+0x108/0x248
[ 1746.937021] [<ffffff8008e24fbc>] tcp_v4_rcv+0xaac/0xc00
[ 1746.937027] [<ffffff8008dfb4b0>] ip_local_deliver_finish+0x80/0x278
[ 1746.937032] [<ffffff8008dfbbfc>] ip_local_deliver+0x54/0xf0
[ 1746.937037] [<ffffff8008dfb880>] ip_rcv_finish+0x1d8/0x3a0
[ 1746.937044] [<ffffff8008dfbf08>] ip_rcv+0x270/0x3a8
[ 1746.937055] [<ffffff8008da9c20>] __netif_receive_skb_core+0x3b8/0xad8
[ 1746.937060] [<ffffff8008dad010>] __netif_receive_skb+0x28/0x78
[ 1746.937065] [<ffffff8008dad08c>] netif_receive_skb_internal+0x2c/0xb0
[ 1746.937070] [<ffffff8008dadcb4>] napi_gro_receive+0x15c/0x188
[ 1746.937081] [<ffffff800894dd90>] eqos_napi_poll_rx+0x358/0x430
[ 1746.937086] [<ffffff8008daf2e4>] net_rx_action+0xf4/0x358
[ 1746.937091] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[ 1746.937100] [<ffffff80080bb218>] irq_exit+0xd0/0x118
[ 1746.937105] [<ffffff8008121bdc>] __handle_domain_irq+0x6c/0xc0
[ 1746.937110] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[ 1746.937114] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[ 1746.937124] [<ffffff80080e0734>] smpboot_thread_fn+0xcc/0x248
[ 1746.937129] [<ffffff80080dbe64>] kthread+0xec/0xf0
[ 1746.937134] [<ffffff80080838a0>] ret_from_fork+0x10/0x30
[ 1786.095398] INFO: rcu_preempt self-detected stall on CPU
[ 1786.095569] 	0-...: (3275 ticks this GP) idle=01d/140000000000002/0 softirq=48490/48490 fqs=2172 
[ 1786.095728] 	 (t=5250 jiffies g=70926 c=70925 q=20929)
[ 1786.095848] Task dump for CPU 0:
[ 1786.095855] ksoftirqd/0     R  running task        0     3      2 0x00000002
[ 1786.095878] Call trace:
[ 1786.095899] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[ 1786.095909] [<ffffff800808c37c>] show_stack+0x24/0x30
[ 1786.095921] [<ffffff80080ecf70>] sched_show_task+0xf8/0x148
[ 1786.095930] [<ffffff80080efc70>] dump_cpu_task+0x48/0x58
[ 1786.095942] [<ffffff80081c1acc>] rcu_dump_cpu_stacks+0xb8/0xec
[ 1786.095955] [<ffffff8008132450>] rcu_check_callbacks+0x728/0xa48
[ 1786.095965] [<ffffff8008138cac>] update_process_times+0x34/0x60
[ 1786.095976] [<ffffff800814a218>] tick_sched_handle.isra.5+0x38/0x70
[ 1786.095983] [<ffffff800814a29c>] tick_sched_timer+0x4c/0x90
[ 1786.095994] [<ffffff80081399e0>] __hrtimer_run_queues+0xd8/0x360
[ 1786.096001] [<ffffff800813a330>] hrtimer_interrupt+0xa8/0x1e0
[ 1786.096015] [<ffffff8008bffe98>] arch_timer_handler_phys+0x38/0x58
[ 1786.096026] [<ffffff8008126f10>] handle_percpu_devid_irq+0x90/0x2b0
[ 1786.096034] [<ffffff80081214f4>] generic_handle_irq+0x34/0x50
[ 1786.096040] [<ffffff8008121bd8>] __handle_domain_irq+0x68/0xc0
[ 1786.096047] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[ 1786.096054] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[ 1786.096066] [<ffffff8008d95758>] skb_release_all+0x30/0x40
[ 1786.096074] [<ffffff8008d958d8>] consume_skb+0x38/0x118
[ 1786.096086] [<ffffff8008e34778>] arp_process+0x160/0x708
[ 1786.096094] [<ffffff8008e34e70>] arp_rcv+0x118/0x1a8
[ 1786.096103] [<ffffff8008da9c20>] __netif_receive_skb_core+0x3b8/0xad8
[ 1786.096110] [<ffffff8008dad010>] __netif_receive_skb+0x28/0x78
[ 1786.096117] [<ffffff8008dad08c>] netif_receive_skb_internal+0x2c/0xb0
[ 1786.096124] [<ffffff8008dadcb4>] napi_gro_receive+0x15c/0x188
[ 1786.096137] [<ffffff800894dd90>] eqos_napi_poll_rx+0x358/0x430
[ 1786.096144] [<ffffff8008daf2e4>] net_rx_action+0xf4/0x358
[ 1786.096151] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[ 1786.096162] [<ffffff80080bb218>] irq_exit+0xd0/0x118
[ 1786.096169] [<ffffff8008121bdc>] __handle_domain_irq+0x6c/0xc0
[ 1786.096175] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[ 1786.096182] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[ 1786.096190] [<ffffff80080baf3c>] run_ksoftirqd+0x4c/0x58
[ 1786.096201] [<ffffff80080e07c8>] smpboot_thread_fn+0x160/0x248
[ 1786.096209] [<ffffff80080dbe64>] kthread+0xec/0xf0
[ 1786.096216] [<ffffff80080838a0>] ret_from_fork+0x10/0x30

Hi,
Do you hit the issue on Xavier developer kit or your custom board? And please check sudo tegrastats. It prints out thermal information. Maybe the device is overheating.

The issue is hit on the official devkit.
I have ever touched the heat sink and it was very cool.

Please check if these patches can help for the bluetooth hostwake interrupt.

I will try the patch later.
Currently I have a workaround by “rmmod bluedroid_pm”

1 Like

The patch might be the official solution so we need your help to kindly validate. Thanks.

I applied the patch and tested nvcompositor pipeline again. It seems that the patch solved my issue.

While running the gst nvcomp pipeline, cpu core 0 remains low and does not rise to 100% any more.
bluedroid_pm interrupt count does not incease faster anymore like before.

But it still increases while the pipeline is running, I don’t know if it is the expected behaviour.

$ grep blue /proc/interrupts ; sleep 10; grep blue /proc/interrupts
392: 9301315 0 0 0 0 0 0 0 tegra-gpio 192 Level bluetooth hostwake
392: 9367621 0 0 0 0 0 0 0 tegra-gpio 192 Level bluetooth hostwake

If I stop the gst pipeline, the interrupt count does not increase (or increases slowly)

Hi,

May I know which gst pipeline you are using now?

Also, is there any bluetooth/wifi card connected on your board? Is it a NV devkit or custom board?

It’s devkit running R32.4.4 and there is no bt/wifi card connected.

The gst pipeline is

gst-launch-1.0 nvcompositor name=comp \
        sink_0::xpos=0 sink_0::ypos=0 sink_0::width=640 sink_0::height=360 \
        sink_1::xpos=640 sink_1::ypos=0 sink_1::width=640 sink_1::height=360 \
        sink_2::xpos=0 sink_2::ypos=720 sink_2::width=320 sink_2::height=180 \
        ! "video/x-raw(memory:NVMM),width=2560,height=900" ! nv3dsink \
        videotestsrc pattern=ball ! video/x-raw,width=640,height=360 ! nvvidconv ! comp. \
        videotestsrc pattern=snow ! video/x-raw,width=640,height=360 ! nvvidconv ! comp. \
        videotestsrc ! video/x-raw,width=320,height=180 ! nvvidconv ! comp.

nvpmodel is

$ sudo nvpmodel -q
NV Fan Mode:cool
NV Power Mode: MAXN
0

My expectation is there should be totally no interrupt after applying this patch.

How many devkit do you have? Is it possible to also try gst pipeline on other xavier?

I tried another xavier devkit, there is the issue in R32.3.1 and R32.4.4.
(intr counter increases faster in R32.4.4 which sometimes causes cpu/net connection stalled)

In R32.5.1, the bluetooth hotwake couter is always 0.
But in R32.5.1, I have another issue running the above pipeline. Somtimes one of cpu cores will remain 100% usage

Can we have a more clear test setup here?
For example, now we have 2 devkits.

  1. Do both devices see bluetooth host wake interrupt on rel-32.5.1 without patch?
  2. Do both devices see bluetooth host wake interrupt on rel-32.5.1 with the patch?

It sounds like this patch does not work for rel-32.4.4 or older release. Which is under my expectation since we always use the latest branch to debug. There may be other patches on rel-32.5.1 which enhance the bluetooth interrupt issue. For older release bluetooth host interrupt issue, I would suggest you can directly rmmod the bluedroid driver as a workaround.

As for cpu cores, may I know what do you mean “remain 100%” usage? Do you already stop the pipeline but the cpu cores still have 100% usage?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.