Orin tegra_uart_rx_buffer_push Kernel panic - not syncing: Oops: Fatal exception in interrupt on jetpack5.1.2

你好,该问题发生时设备会自动重启,详细信息如下:

  1. 是custom carrier board,通过ttyTHS0和ttyTHS1两个串口操作波特率460800的GPS设备(ZED-F9K)时发生
  2. 同样的代码,在JP502上不会发生死机,升级到JP512就不定时死机
  3. 发生时间不固定,快的时候半小时,慢的时候四五个小时才死机,死机后会自动重启
  4. 死机时的堆栈信息如下:
[ 2684.892093] serial-tegra 3110000.serial: RxData DMA copy to tty layer failed
[ 2685.410577] serial-tegra 3110000.serial: RxData PIO to tty layer failed
[ 2685.410866] tegra-gpcdma 2600000.gpcdma: slave id already in use
[ 2685.411077] serial-tegra 3110000.serial: Not able to get desc for Rx
[ 2685.411938] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004
[ 2685.412190] Mem abort info:
[ 2685.412277]   ESR = 0x96000004
[ 2685.412373]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 2685.412528]   SET = 0, FnV = 0
[ 2685.412624]   EA = 0, S1PTW = 0
[ 2685.412711] Data abort info:
[ 2685.412796]   ISV = 0, ISS = 0x00000004
[ 2685.412907]   CM = 0, WnR = 0
[ 2685.413002] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001287eb000
[ 2685.413183] [0000000000000004] pgd=0000000000000000, p4d=0000000000000000
[ 2685.413371] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 2685.413530] Modules linked in: nvidia_modeset(OE) fuse(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) 

nf_defrag_ipv4(E) libcrc32c(E) xt_addrtype(E) iptable_filter(E) br_netfilter(E) lzo_rle(E) lzo_compress(E) zram(E) overlay(E) ramoops(E) reed_solomon(E) realtek(E) snd_soc_tegra186_asrc(E) 

snd_soc_tegra186_dspk(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra210_iqc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_dmic(E) aes_ce_blk(E) 

crypto_simd(E) snd_soc_tegra210_adx(E) cryptd(E) aes_ce_cipher(E) mttcan(E) ghash_ce(E) input_leds(E) cdc_acm(E) snd_soc_tegra210_amx(E) sha2_ce(E) snd_soc_tegra210_admaif(E) can_dev(E) 

snd_hda_codec_hdmi(E) snd_soc_tegra210_sfc(E) snd_soc_tegra210_i2s(E) snd_soc_tegra210_mixer(E) sha256_arm64(E) snd_soc_tegra210_adsp(E) snd_soc_tegra_pcm(E) ucsi_ccg(E) snd_hda_tegra(E) 

can_raw(E) sha1_ce(E) snd_soc_tegra_machine_driver(E) snd_hda_codec(E)
[ 2685.413676]  snd_soc_tegra_utils(E) typec_ucsi(E) can(E) r8168(E) snd_soc_simple_card_utils(E) snd_soc_spdif_tx(E) nvadsp(E) typec(E) snd_soc_tegra210_ahub(E) snd_hda_core(E) 

tegra_bpmp_thermal(E) snd_soc_rt5640(E) nct1008(E) i2c_nvvrs11(E) userspace_alert(E) tegra210_adma(E) snd_soc_rl6231(E) spi_tegra114(E) nvidia(OE) loop(E) ina3221(E) pwm_fan(E) binfmt_misc

(E) nvgpu(E) nvmap(E) ip_tables(E) x_tables(E) [last unloaded: mtd]
[ 2685.518781] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OE     5.10.120-tegra #7
[ 2685.526912] Hardware name: Unknown Jetson AGX Orin Developer Kit/Jetson AGX Orin Developer Kit, BIOS 202210.3-52cefd4-dirty 08/31/2023
[ 2685.539252] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[ 2685.545305] pc : tegra_uart_rx_buffer_push+0x38/0x190
[ 2685.550364] lr : tegra_uart_rx_buffer_push+0x30/0x190
[ 2685.555613] sp : ffff800010003d80
[ 2685.559025] x29: ffff800010003d80 x28: 0000000000000002 
[ 2685.564536] x27: 0000000000000005 x26: ffff65efc39ca1b0 
[ 2685.570050] x25: ffffffffffffffca x24: 0000000000000001 
[ 2685.575561] x23: 0000000000000005 x22: ffff65efc39ca1b0 
[ 2685.581073] x21: ffff65f015f42400 x20: 0000000000000ff0 
[ 2685.586587] x19: ffff65efc6e66880 x18: 0000000000000000 
[ 2685.592099] x17: 0000000000000000 x16: 0000000000000000 
[ 2685.597523] x15: 0000ffffac9367f8 x14: 0000000000000000 
[ 2685.603036] x13: 0000000000000000 x12: 0000000000000000 
[ 2685.608374] x11: 0000000000000040 x10: ffffac2b5c637b20 
[ 2685.613885] x9 : ffffac2b5c637b18 x8 : ffff65efc04b6018 
[ 2685.619224] x7 : ffff000000000000 x6 : 0000000000000001 
[ 2685.624560] x5 : 0000000000000001 x4 : ffff65f6ecf50140 
[ 2685.629985] x3 : 0000000000000000 x2 : ffffac2b5a8a0d70 
[ 2685.635325] x1 : 0000000000000000 x0 : ffff65f015f42400 
[ 2685.640662] Call trace:
[ 2685.643114]  tegra_uart_rx_buffer_push+0x38/0x190
[ 2685.647664]  tegra_uart_terminate_rx_dma+0x84/0xe0
[ 2685.652475]  tegra_uart_isr+0x41c/0x4a0
[ 2685.656420]  __handle_irq_event_percpu+0x68/0x2a0
[ 2685.660964]  handle_irq_event_percpu+0x40/0xa0
[ 2685.665251]  handle_irq_event+0x50/0xf0
[ 2685.669013]  handle_fasteoi_irq+0xc0/0x170
[ 2685.673214]  generic_handle_irq+0x40/0x60
[ 2685.677239]  __handle_domain_irq+0x70/0xd0
[ 2685.681266]  gic_handle_irq+0x68/0x134
[ 2685.685025]  el1_irq+0xd0/0x180
[ 2685.688181]  cpuidle_enter_state+0xb8/0x410
[ 2685.692201]  cpuidle_enter+0x40/0x60
[ 2685.695703]  call_cpuidle+0x44/0x80
[ 2685.699375]  do_idle+0x208/0x270
[ 2685.702613]  cpu_startup_entry+0x2c/0x70
[ 2685.706384]  rest_init+0xdc/0xe8
[ 2685.709620]  arch_call_rest_init+0x18/0x20
[ 2685.713813]  start_kernel+0x500/0x538
[ 2685.717318] Code: aa1603e0 97ff1f89 f9413a61 aa0003f5 (b9400420) 
[ 2685.723541] ---[ end trace d5681ae00e07a9ef ]---
[ 2685.739818] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 2685.740012] SMP: stopping secondary CPUs
[ 2685.740130] Kernel Offset: 0x2c2b4a6e0000 from 0xffff800010000000
[ 2685.745225] PHYS_OFFSET: 0xffff9a1140000000
[ 2685.749340] CPU features: 0x08040006,4a80aa38
[ 2685.753623] Memory Limit: none
[ 2685.768294] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

欢迎提供任何建议,谢谢

Hi lihl,

Please share the full dmesg when the issue occurs for further check.

Would you hit the issue if you don’t use /dev/ttyTHS0 and /dev/ttyTHS1?

  1. 不操作 /dev/ttyTHS0/dev/ttyTHS1时,不会出现这个死机重启问题
  2. 附件是开机dmesg打印日志,开机后运行操作ttyTHS0和ttyTHS1的业务软件时不会打印dmesg日志,运行业务软件一段时间(半小时~数小时,时间不固定)就会打印一楼的’Kernel panic’日志然后设备重启。附件的dmesg+一楼的’Kernel panic’日志就是问题发生过程的所有打印出来的日志。
    dmesg-log.zip (22.9 KB)

谢谢

方便提供你這邊連接的方式嗎?
是否有使用HW flow control? (是否接RTS/CTS)
是否嘗試過用其他baudrate (例如115200) 來使用GPS設備 看看是否也會遇到此kernel panic?

Could you share the block diagram of your connections?
Do you use the HW flow control? (i.e. do you conenct RTS/CTS?)
Have you tried other baudrate like 115200 for GPS device?

1.连接方式:Orin的ttyTHS0和ttyTHS1分别连接了GPS模块(ZED-F9K)的两个串口。两个串口,硬件都没有接RTS和CTS,都只接了TX和RX两个脚。
2.是否能硬流控:硬件没有使用RTS和CTS,只能软流控。我看错误堆栈提示"Fatal exception in interrupt"最后堆栈到了serial-tegra-nohw.c文件,没搜到这个.c有什么补丁。
3.试验其他波特率:刚做的测试,使用38400的波特率,只读取ttyTHS1串口传输的GPS位置信息,也一样死机
4. 比较奇怪的是JP5.02不会死机,升级到JP5.12才有死机问题。

谢谢

因為我們這邊沒有這個GPS模組,無法local進行複製此問題
請問你有開發版也能重現一樣的問題嗎?
或者你有試過JP5.1(R35.2.1) or JP5.1.1(R35.3.1)嗎?

We don’t have the GPS module to verify the issue locally.
Do you have devkit with GPS module could reproduce the same behavior?
Or have you tried JP5.1(R35.2.1) or JP5.1.1(R35.3.1)?

没有Orin的开发板来验证是否能重现,因为公司觉的从Xavier一路做过来,已经比较熟悉了,不给买Orin开发板
没有试过JP5.1和JP5.1.1,设备从JP5.0.2直接升级到了JP5.1.2

local复现该问题能得到的信息,大概也是跟一楼的’Call trace’信息一样。
从这一行打印

[ 2685.411938] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000004

我看到是个空指针导致死机,还有’Call trace’中

[ 2685.643114]  tegra_uart_rx_buffer_push+0x38/0x190

能帮看下是否能从这里的偏移量获取到具体是哪行代码吗?

我现在查这些堆栈打印,还没有进展,方便的话麻烦给查下,缩小一下导致这个’NULL pointer’死机的代码范围?

谢谢

我正在和內部確認這個kernel panic
看是否有遇過相關的問題 有新的資訊再更新給你

I’m checking these call stack of kernel panic with internal and will update to you once getting any result.

你好,
这个"tegra_uart_rx_buffer_push+0x38/0x190"空指针的问题有新消息吗,我现在找不到空的指针是哪个,甚至具体是哪行代码导致的空指针都不能确定。我如果要根据"0x38/0x190"这个偏移量缩小范围排查范围的话,请为有什么方法、资料或者建议吗?一定要反汇编吗?

是否能幫忙確認以下的commit已包含在你的kernel source?
serial: tegra: fix uart error handler

Could you help to check if the following commit included in your kernel source?
serial: tegra: fix uart error handler

感谢回复,我对比“Diff - 08130f89dcf99cc883c779a125de27ee80d9400d^! - linux-5.10 - Gitiles
发现:现有代码已经集成了这个commit的代码。
我下载JP5.1.2的代码本身就包含了这部分修改。

问题特征补充:修改tty_buffer.c中“define TTYB_DEFAULT_MEM_LIMIT”的(640 * 1024UL)值,会直接影响“NULL pointer”的发生时间,我这里改成(64 * 1024UL)大概5分钟就出现空指针,原来的(640 * 1024UL)需要半小时会出空指针。
“TTYB_DEFAULT_MEM_LIMIT”是buffer full的值,这个commit是修改buffer full时RX的问题,但看起来、貌似、好像…修改“buffer full”引入了“NULL pointer”,因为JP5.0.2没有这个死机问题

Is this still an issue to support? Any result can be shared? Thanks

这个空指针问题已解决,临时修改方式如下:
根据Call trace错误堆栈的提示,在“serial-tegra.c”文件的tegra_uart_rx_buffer_push()和tegra_uart_terminate_rx_dma()中,各自删掉了一行“async_tx_ack(tup->rx_dma_desc);”
修改后空指针死机问题消失,感谢回复

1 Like