Serial UART- Bluetooth Hangs the entire kernel and OS in Tegra K1 BSP

Hello Everyone,

We are trying to bring-up UART interface Bluetooth chip WL1835 in TK1 and we are facing below issue while pairing HID devices. Issue is more common while Bluetooth file transfer.

Issue description:

When we try to pair any device from TK1 or send files from Tk1 to other devices, inconsistently entire OS is hanged. Even we cannot magic-sysrq key and I have verified magic-sysrq is enabled by checking

cat /proc/sys/kernel/sysrq

After few seconds or mins, below crash messages is dumped into debug console.

130|shell@esomtk1:/ $ [  571.643265] ------------[ cut here ]------------
[  571.647943] WARNING: at /home/gopinath/projects/android/sourcecode/kernel/kernel/watchdog.c:270 watchdog_timer_fn+0x288/0x2c4()
[  571.660006] Watchdog detected hard LOCKUP on cpu 0
[  571.664614] Modules linked in: wl18xx(O) wlcore_sdio(O) wlcore(O) mac80211(O) cfg80211(O) compat(O) tegra_usb_oc_detect_module tegra_usb_otg_switching tegra_camera videobuf2_dma_contig ov4682_csi_B(O) )
[  571.685893] CPU: 2 PID: 18 Comm: migration/2 Tainted: G           O 3.10.40-g072e283-dirty #1
[  571.694434] [<c00166c0>] (unwind_backtrace+0x0/0x13c) from [<c0012f18>] (show_stack+0x18/0x1c)
[  571.703041] [<c0012f18>] (show_stack+0x18/0x1c) from [<c00657f4>] (warn_slowpath_common+0x5c/0x74)
[  571.711989] [<c00657f4>] (warn_slowpath_common+0x5c/0x74) from [<c0065844>] (warn_slowpath_fmt+0x38/0x48)
[  571.721541] [<c0065844>] (warn_slowpath_fmt+0x38/0x48) from [<c00d3694>] (watchdog_timer_fn+0x288/0x2c4)
[  571.731018] [<c00d3694>] (watchdog_timer_fn+0x288/0x2c4) from [<c00900dc>] (__run_hrtimer+0x90/0x2b4)
[  571.740226] [<c00900dc>] (__run_hrtimer+0x90/0x2b4) from [<c0090eb8>] (hrtimer_interrupt+0x11c/0x2a4)
[  571.749443] [<c0090eb8>] (hrtimer_interrupt+0x11c/0x2a4) from [<c065af1c>] (arch_timer_handler_phys+0x30/0x38)
[  571.759436] [<c065af1c>] (arch_timer_handler_phys+0x30/0x38) from [<c00d7618>] (handle_percpu_devid_irq+0x88/0x1a4)
[  571.769858] [<c00d7618>] (handle_percpu_devid_irq+0x88/0x1a4) from [<c00d3a6c>] (generic_handle_irq+0x30/0x40)
[  571.779844] [<c00d3a6c>] (generic_handle_irq+0x30/0x40) from [<c000f8c0>] (handle_IRQ+0x48/0x98)
[  571.788617] [<c000f8c0>] (handle_IRQ+0x48/0x98) from [<c000850c>] (gic_handle_irq+0x58/0x15c)
[  571.797128] [<c000850c>] (gic_handle_irq+0x58/0x15c) from [<c000eb80>] (__irq_svc+0x40/0x70)
[  571.805549] Exception stack(0xdd011e38 to 0xdd011e80)
[  571.810589] 1e20:                                                       00000000 00000021
[  571.818752] 1e40: 00000000 00000001 dc8fbe50 00000001 dc8fbe64 a0000153 00000000 dc8fbe50
[  571.826914] 1e60: c00d2d44 00000002 00000000 dd011e80 c00d2c8c c00d2df8 60000153 ffffffff
[  571.835080] [<c000eb80>] (__irq_svc+0x40/0x70) from [<c00d2df8>] (stop_machine_cpu_stop+0xb4/0x114)
[  571.844112] [<c00d2df8>] (stop_machine_cpu_stop+0xb4/0x114) from [<c00d2c8c>] (cpu_stopper_thread+0x84/0x13c)
[  571.854017] [<c00d2c8c>] (cpu_stopper_thread+0x84/0x13c) from [<c00951c0>] (smpboot_thread_fn+0x16c/0x27c)
[  571.863658] [<c00951c0>] (smpboot_thread_fn+0x16c/0x27c) from [<c008c8c0>] (kthread+0xe0/0xe4)
[  571.872256] [<c008c8c0>] (kthread+0xe0/0xe4) from [<c000f058>] (ret_from_fork+0x14/0x20)
[  571.880331] ---[ end trace 81e17e812bb605ad ]---
[  575.449911] INFO: rcu_preempt detected stalls on CPUs/tasks: { 0} (detected by 2, t=21008 jiffies, g=25645, c=25644, q=16)
[  575.460985] Backtrace for cpu 2 (current):
[  575.465073] CPU: 2 PID: 18 Comm: migration/2 Tainted: G        W  O 3.10.40-g072e283-dirty #1
[  575.473588] [<c00166c0>] (unwind_backtrace+0x0/0x13c) from [<c0012f18>] (show_stack+0x18/0x1c)
[  575.482188] [<c0012f18>] (show_stack+0x18/0x1c) from [<c0015480>] (smp_send_all_cpu_backtrace+0x78/0xd4)
[  575.491658] [<c0015480>] (smp_send_all_cpu_backtrace+0x78/0xd4) from [<c00dd2e8>] (rcu_check_callbacks+0x700/0x884)
[  575.502085] [<c00dd2e8>] (rcu_check_callbacks+0x700/0x884) from [<c0076928>] (update_process_times+0x48/0x74)
[  575.511993] [<c0076928>] (update_process_times+0x48/0x74) from [<c00bdde4>] (tick_sched_handle.isra.13+0x58/0x64)
[  575.522240] [<c00bdde4>] (tick_sched_handle.isra.13+0x58/0x64) from [<c00bde44>] (tick_sched_timer+0x54/0x80)
[  575.532139] [<c00bde44>] (tick_sched_timer+0x54/0x80) from [<c00900dc>] (__run_hrtimer+0x90/0x2b4)
[  575.541083] [<c00900dc>] (__run_hrtimer+0x90/0x2b4) from [<c0090eb8>] (hrtimer_interrupt+0x11c/0x2a4)
[  575.550287] [<c0090eb8>] (hrtimer_interrupt+0x11c/0x2a4) from [<c065af1c>] (arch_timer_handler_phys+0x30/0x38)
[  575.560272] [<c065af1c>] (arch_timer_handler_phys+0x30/0x38) from [<c00d7618>] (handle_percpu_devid_irq+0x88/0x1a4)
[  575.570691] [<c00d7618>] (handle_percpu_devid_irq+0x88/0x1a4) from [<c00d3a6c>] (generic_handle_irq+0x30/0x40)
[  575.580677] [<c00d3a6c>] (generic_handle_irq+0x30/0x40) from [<c000f8c0>] (handle_IRQ+0x48/0x98)
[  575.589447] [<c000f8c0>] (handle_IRQ+0x48/0x98) from [<c000850c>] (gic_handle_irq+0x58/0x15c)
[  575.597957] [<c000850c>] (gic_handle_irq+0x58/0x15c) from [<c000eb80>] (__irq_svc+0x40/0x70)
[  575.606378] Exception stack(0xdd011e38 to 0xdd011e80)
[  575.611418] 1e20:                                                       00000000 00000021
[  575.619580] 1e40: 00000000 00000001 dc8fbe50 00000001 dc8fbe64 a0000153 00000000 dc8fbe50
[  575.627742] 1e60: c00d2d44 00000002 00000000 dd011e80 c00d2c8c c00d2df8 60000153 ffffffff
[  575.635906] [<c000eb80>] (__irq_svc+0x40/0x70) from [<c00d2df8>] (stop_machine_cpu_stop+0xb4/0x114)
[  575.644938] [<c00d2df8>] (stop_machine_cpu_stop+0xb4/0x114) from [<c00d2c8c>] (cpu_stopper_thread+0x84/0x13c)
[  575.654835] [<c00d2c8c>] (cpu_stopper_thread+0x84/0x13c) from [<c00951c0>] (smpboot_thread_fn+0x16c/0x27c)
[  575.664474] [<c00951c0>] (smpboot_thread_fn+0x16c/0x27c) from [<c008c8c0>] (kthread+0xe0/0xe4)
[  575.673071] [<c008c8c0>] (kthread+0xe0/0xe4) from [<c000f058>] (ret_from_fork+0x14/0x20)
[  575.681144] 
[  575.681144] sending IPI to all other CPUs:
[  575.686706] IPI backtrace for cpu 1
[  575.690190] 
[  575.691679] CPU: 1 PID: 13 Comm: migration/1 Tainted: G        W  O 3.10.40-g072e283-dirty #1
[  575.700189] task: dda8eac0 ti: ddaae000 task.ti: ddaae000
[  575.705580] PC is at stop_machine_cpu_stop+0xb4/0x114
[  575.710623] LR is at cpu_stopper_thread+0x84/0x13c
[  575.715404] pc : [<c00d2df8>]    lr : [<c00d2c8c>]    psr: 600f0013
[  575.715404] sp : ddaafe80  ip : 00000000  fp : 00000002
[  575.726860] r10: c00d2d44  r9 : dc8fbe50  r8 : 00000001
[  575.732074] r7 : a00f0013  r6 : dc8fbe64  r5 : 00000001  r4 : dc8fbe50
[  575.738587] r3 : 00000001  r2 : 00000000  r1 : 00000020  r0 : 00000000
[  575.745105] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[  575.752398] Control: 10c5387d  Table: 9b04006a  DAC: 00000015

And this issue can be easily reproduced.


Below our are observation:

1.Issue occurs both in Linux For Tegra K1 BSP and Shield tablet android source BSP for Jetson Tk1.

2.Issue is in UART serial driver.

3.We confirmed there is no issues with Bluetooth chip and firmware by interfacing it with IMX6 processor.

4.We checked with different baud rates. (3000000,500000,230400,115200) But same issue happens.

NOTE:

We are using R21.4 Jetson release package (kernel version - 3.10.40).

Thanks.

The appearance of the watchdog timer is not unexpected when the system locks up, but the backtraces are showing only CPU1 and 2; mentioning lockup is in CPU0 doesn’t actually show anything (it’s one of those messages that says “there’s a problem”, but doesn’t actually indicate any detail about CPU0 indicating where or why it locked up). You might need to give detailed information on how to reproduce the issue.

Hi LinuxDev,

Today I came up with further information and steps to reproduce this issue. Issue seems to be in UART driver. Below are the set up I having here to reproduce this issue.

  1. Having two UART interface connected with TK1 processor.

  2. One is for UART-Bluetooth interface chip and other is for testing and other miscellaneous purposes.

  3. To isolate this issue, now I am testing with another UART(/dev/ttyTHS2) connected to the TK1 processor.

  4. I have checked this issue in two different software platforms (Android for jetson TK1 BSP and Linux for Jetson TK1 BSP). Both are officially released by NVIDA.
    a) In Android, I have installed Serial Port API application. Below is the link for that application
    https://play.google.com/store/apps/details?id=android_serialport_api.sample&hl=en
    http://apk-dl.com/serial-port-api-sample

    b) Now I have connected that UART with USB to UART converter with Linux PC.
    c) In PC, I have opened minicom of that UART and tested with different baud rates (3000000, 500000,230400)
    d) Also same setup in TK1 android device with that application.
    e) Now when I send and receive data, suddenly OS hangs in my Android TK1 device with above said exception. Only way to recover from this is to hard reset the device.

The same issue can be reproduced in Linux platform too. there Instead of android application we can use minicom .

This issue is more common when baud rate is high. Here 30 lakhs (3000000) But occurs in all baud rate.

Is there any limitation with UART driver? or some issue?

There was a patch for serial UART a while back for the TX1 on R24.1, but not the TK1. I’m not really sure if that patch would even matter for the TK1, but I’m thinking this is the same issue (despite being 64-bit for TX1 versus 32-bit for TK1 there is a lot in common among all Tegra products going back in time a very long way). This earlier patch was for devices using the nVidia lower latency interfaces (those with naming “/dev/ttyTHS*”). There is a serial port program which was used as an aid to validate serial ports, it’s located here (this is a good way to exercise serial ports):
https://github.com/cbrake/linux-serial-test

The thread on that topic is long, and for the TX1, but may be of use. See:
https://devtalk.nvidia.com/default/topic/946770/jetson-tx1/serial-communication-issue-quot-got-overrun-errors-quot-/1

The particular patch is at this URL:
https://bitbucket.org/tealdrones/nvidia_tx1_patches

See this comment for the URL at the end of the thread:
https://devtalk.nvidia.com/default/topic/946770/jetson-tx1/serial-communication-issue-quot-got-overrun-errors-quot-/post/4943845/#4943845

Despite the different kernels I suspect this patch can be applied (line numbers might differ, so it might require some tweaks). Can you try that patch and see if it changes the issue?

Thanks.I will check and test that patches in TK1 and update here in a day or two.

By the mean time, I have another question related to this UART issue. When I debugged this issue I found out, this issue is related to DMA enabled cases in that UART driver. The patch links you have provided also uses DMA enabled cases. So Also want to give a try with DMA disabled UART driver.

So can you tell me, Is there a way to disable DMA in this UART serial driver?

I’m not positive, but I think the serial UART will use DMA if it is one of the name format “ttyTHS*”, whereas the ones without the “THS” in the name will not use DMA. The THS devices were intended to have improved performance, whereas regular non-THS tty devices do not.

Hi,

The below patch may help to resolve this issue.

https://patchwork.kernel.org/patch/4042131/

Android-kernel may require additional patches. You can refer http://www.bitsandqubits.com/2017/08/bluetooth-hcildisc-fix-deadlock.html link.