Reboot on orin nx

jp5.1.1
kernel NULL pointer to reboot on orin nx

kernel 串口空指针,会导致重启

[2024-03-16 07:27:20] [32840.435637] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[2024-03-16 07:27:20] [32840.442440] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[2024-03-16 07:27:20] [32840.449244] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[2024-03-16 07:27:20] [32840.462156] serial-tegra 3100000.serial: RxData PIO to tty layer failed
[2024-03-16 07:27:21] [32840.999609] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000001
[2024-03-16 07:27:21] [32841.008650] Mem abort info:
[2024-03-16 07:27:21] [32841.011520]   ESR = 0x96000044
[2024-03-16 07:27:21] [32841.014656]   EC = 0x25: DABT (current EL), IL = 32 bits
[2024-03-16 07:27:21] [32841.020104]   SET = 0, FnV = 0
[2024-03-16 07:27:21] [32841.023241]   EA = 0, S1PTW = 0
[2024-03-16 07:27:21] [32841.026463] Data abort info:
[2024-03-16 07:27:21] [32841.029410]   ISV = 0, ISS = 0x00000044
[2024-03-16 07:27:21] [32841.033343]   CM = 0, WnR = 1
[2024-03-16 07:27:21] [32841.036388] user pgtable: 4k pages, 48-bit VAs, pgdp=000000013a1c2000
[2024-03-16 07:27:21] [32841.042994] [0000000000000001] pgd=0000000000000000, p4d=0000000000000000
[2024-03-16 07:27:21] [32841.049979] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[2024-03-16 07:27:21] [32841.055712] Modules linked in: input_leds fuse nvidia_modeset(OE) r8168(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ia
[2024-03-16 07:27:21] [32841.055783]  spi_tegra114 nvidia(OE) binfmt_misc nvmap ip_tables x_tables [last unloaded: mtd]
[2024-03-16 07:27:21] [32841.155723] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  OE     5.10.104-tegra #1
[2024-03-16 07:27:21] [32841.163915] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS r35.3.1-5e812e4-dirty 11/05/2023
[2024-03-16 07:27:21] [32841.176027] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[2024-03-16 07:27:21] [32841.182183] pc : tegra_uart_rx_error_handle_timer+0x5c/0x80
[2024-03-16 07:27:21] [32841.187882] lr : tegra_uart_rx_error_handle_timer+0x28/0x80
[2024-03-16 07:27:21] [32841.193587] sp : ffff800010003da0
[2024-03-16 07:27:21] [32841.196980] x29: ffff800010003da0 x28: ffff6b7b68381cc0 
[2024-03-16 07:27:21] [32841.202421] x27: 00000001007c21d0 x26: ffff800010003e70 
[2024-03-16 07:27:21] [32841.207863] x25: ffff6b78064da350 x24: ffffb6376a456000 
[2024-03-16 07:27:21] [32841.213309] x23: ffffb6376a4626c0 x22: 00000001007c21d0 
[2024-03-16 07:27:21] [32841.218750] x21: 0000000000000101 x20: ffff6b78064da0b0 
[2024-03-16 07:27:21] [32841.224191] x19: ffff6b78064d2350 x18: 0000000000000000 
[2024-03-16 07:27:21] [32841.229625] x17: 0000000000000000 x16: 0000000000000000 
[2024-03-16 07:27:21] [32841.235070] x15: 0000fffffdedca08 x14: 000000000000001e 
[2024-03-16 07:27:21] [32841.240514] x13: 000000000000009a x12: 0000000000000024 
[2024-03-16 07:27:21] [32841.245953] x11: 0000000000000000 x10: 0000000000000001 
[2024-03-16 07:27:21] [32841.251384] x9 : 0400000000000000 x8 : ffff6b7b68381ce8 
[2024-03-16 07:27:21] [32841.256817] x7 : ffff6b7b68381d30 x6 : 00000001fffe0000 
[2024-03-16 07:27:21] [32841.262255] x5 : 0000000000000011 x4 : 0000000000000000 
[2024-03-16 07:27:21] [32841.267687] x3 : 0000000000000001 x2 : 00000000fff6e034 
[2024-03-16 07:27:21] [32841.273127] x1 : 0000000000000000 x0 : 0000000000000001 
[2024-03-16 07:27:21] [32841.278568] Call trace:
[2024-03-16 07:27:21] [32841.281065]  tegra_uart_rx_error_handle_timer+0x5c/0x80
[2024-03-16 07:27:21] [32841.286407]  call_timer_fn+0x3c/0x200
[2024-03-16 07:27:21] [32841.290150]  run_timer_softirq+0x464/0x5d0
[2024-03-16 07:27:21] [32841.294338]  __do_softirq+0x140/0x3e8
[2024-03-16 07:27:21] [32841.298086]  irq_exit+0xc0/0xe0
[2024-03-16 07:27:21] [32841.301288]  __handle_domain_irq+0x74/0xd0
[2024-03-16 07:27:21] [32841.305484]  gic_handle_irq+0x68/0x134
[2024-03-16 07:27:21] [32841.309320]  el1_irq+0xd0/0x180
[2024-03-16 07:27:21] [32841.312532]  cpuidle_enter_state+0xb8/0x410
[2024-03-16 07:27:21] [32841.316806]  cpuidle_enter+0x40/0x60
[2024-03-16 07:27:21] [32841.320465]  call_cpuidle+0x44/0x80
[2024-03-16 07:27:21] [32841.324037]  do_idle+0x208/0x270
[2024-03-16 07:27:21] [32841.327343]  cpu_startup_entry+0x30/0x70
[2024-03-16 07:27:21] [32841.331360]  rest_init+0xdc/0xe8
[2024-03-16 07:27:21] [32841.334664]  arch_call_rest_init+0x18/0x20
[2024-03-16 07:27:21] [32841.338854]  start_kernel+0x514/0x54c
[2024-03-16 07:27:21] [32841.342606] Code: 39786404 f97eba60 9ac42063 8b030000 (b9000002) 
[2024-03-16 07:27:21] [32841.348855] ---[ end trace 0c8830142e224f53 ]---
[2024-03-16 07:27:21] [32841.359155] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[2024-03-16 07:27:21] [32841.366204] SMP: stopping secondary CPUs
[2024-03-16 07:27:21] [32841.370218] Kernel Offset: 0x363758730000 from 0xffff800010000000
[2024-03-16 07:27:21] [32841.376450] PHYS_OFFSET: 0xffff948900000000
[2024-03-16 07:27:21] [32841.380726] CPU features: 0x0040006,4a80aa38
[2024-03-16 07:27:21] [32841.385098] Memory Limit: none
[2024-03-16 07:27:21] [32841.393779] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

minicom_log20240316.txt (365.4 KB)

Hi sorry.shao,

Are you using the devkit or custom board?
Have you also verified with the latest JP5.1.3(L4T R35.5.0)?

Do you run any application for UART before hitting this kernel panic?

custom board,
I try 5.1.2.

our application do send 8 bit data

Could you verify with the latest JP5.1.3(R35.5.0)?

and also share your UART application for us to reproduce it locally.

JP5.1.3 kernel driver 没有适配在我们自己开发板上

Do you have the devkit to verify with R35.5.0?

Please share the detailed steps how to run UART application.
(i.e. please provide the UART application and also the commands you used to set them up)

在客户工厂里面,没办法使用开发板

The devkit is used for debug purpose only.

Please also share your UART application and the usage for further check.

我们手上也没有开发板,而且需要用到232,和四个网口

It seems we don’t have these design on the devkit by default.

Do you mean that the issue is specific to your custom carrier board?

请问一下,你们有没有寄存器查看串口状态,我们替换5.1.2uart驱动,虽然没有控制整但是会一直打印报错的,一秒钟几十次打印

问一下,我们设备上串口rts和cts没有接上去,这个会导致接收模块的FIFO满,从未无法在接口数据吗
Screenshot from 2024-04-17 15-40-30

如果你的UART application沒有enable HW flow control的話,RTS/CTS是否連接,基本上不會影響你資料的傳輸

明白了,那就和这个没有关系,谢谢

经过我们测试发现:

在AGX Orin和Orin NX上,用cutecom串口工具做232测试,测试步骤如下:

1.串口工具cutecom只发送数据,不接收数据。

2.232另外一端接到PC电脑上,PC电脑会一直发送数据,1S中100次16进制数据,例如A0 FF 02 07数据

按照如上测试发现平台会一直kernel报错,或者串口空指针引导致重启

請問"1S"指的是什麼呢?

你是如何發送數據的呢?
如果用stty command設定UART並且用echo指令發送data是否也會遇到kernel panic

  1. 1S是1秒的意思,就是一秒中发100次16进制数据。
  2. 另外一端是通过电脑软件工具一直发送数据的,是通过软件工具实现的

测试步骤如下:

在AGX Orin、Orin NX、xavier-nx上,用cutecom串口工具做rs232测试,测试步骤如下:
1.	rs232一端接PC,一端接AGX Orin或Orin NX设备。
2.	PC电脑端 一直发送数据,1S中100次16进制数据,例如A0 FF 02 07数据
3.	AGX Orin和Orin NX设备上,在串口软件工具cutecom选择只发送数据,其实质是没有数据,也不接收数据,如图片
按照如上测试步骤,测试三到四个小时会kernel报错,或者串口空指针引导致重启,

上面测试目的是,orin开发板上硬件串口RX会一直有数据,但是软件没有获取数据,测试几个小时就会出现重启或者kernel 报错,出现异常

开发板上没有232,没办法在开发板上尝试,有没有办法能做到,我们可以去尝试。
备注usb 232不会出现这样情况。

能解釋下發送數據但實際沒有數據的意思是什麼呢?
請問是什麼圖片呢?

有試過短接TX/RX, 一直從TX發送數據到RX,這樣能複製的出來嗎?

我们尝试过在开发板上也会出现这个问题,开发板上cutecom设置如下图
Screenshot from 2024-04-30 10-13-03

硬件会有数据,但是软件没有去接受数据,就会出现此问题

To sum up,this problem exists whether we test on the official devkit or on our customer board.After many times tests we now know that this problem is caused by a full tty buffer below:
define TTYB_DEFAULT_MEM_LIMIT (640 * 1024UL)
However, problem still occur when we increase the tty buffer to a 10x larger size.So can you provide a way to clear the tty buffer when the problem occur?

On the other hand,we compared the difference between the official 5.10.104 linux kernel source code and jetpack 5.1.2 kernel source code, we found that the difference in handling the exception, the official linux kernel source code is to discard the exception and not processed.But you handled the exception follow:
close interrupt → open error timer → open interrupt, all these operation will lead to a large cost of the system.This will cause the system to stall.Please judge whether it is reasonable?We think this exception handling can be thrown out.