The Orin NX will experience a link down issue after running for one to two days, and the kernel will crash

客户自定义板子 orin nx jp5.1.1

orin nx运行一到两天会出现link down情况,kernel会崩溃

背景:

B500 16G(orin nx 模组)/ jp5.1.1 环境上,B500设备通过交换机接2几路摄像头和一路雷达,具体结构如下图:

  1. eth0 网口使用的是lcfc开发载板的RTL8111KI-CG,eth1 使用的是orin nx 模组的RTL8111HN。
  2. 运行1到2天会出现出现link down情况,之后无法使用网口,且kernel会call trace崩溃。eth0和eth1均会出现。且在实验室也会出现,在常温环境下测试。
Sep 21 04:17:46 localhost kernel: [130358.500320] ------------[ cut here ]------------
Sep 21 04:17:46 localhost kernel: [130358.505183] NETDEV WATCHDOG: eth1 (r8168): transmit queue 0 timed out
Sep 21 04:17:46 localhost kernel: [130358.505219] WARNING: CPU: 5 PID: 1824419 at net/sched/sch_generic.c:467 dev_watchdog+0x394/0x3a0
Sep 21 04:17:46 localhost kernel: [130358.514347] Modules linked in: fuse nvidia_modeset(OE) r8168(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter lzo_rle lzo_compress zram overlay ramoops reed_solomon loop nvgpu snd_soc_tegra186_asrc snd_soc_tegra210_ope snd_soc_tegra210_admaif snd_soc_tegra186_dspk snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra_pcm snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_sfc snd_soc_tegra210_adsp aes_ce_blk crypto_simd snd_soc_tegra_machine_driver cryptd aes_ce_cipher ghash_ce sha2_ce snd_soc_tegra_utils snd_hda_codec_hdmi nvadsp sha256_arm64 sha1_ce snd_hda_tegra snd_hda_codec snd_soc_simple_card_utils snd_soc_spdif_tx pwm_fan snd_soc_tegra210_ahub tegra_bpmp_thermal userspace_alert ina3221 tegra210_adma snd_hda_core nvidia(OE) nv_ox03c10_a
Sep 21 04:17:46 localhost kernel: [130358.514434]  spi_tegra114 binfmt_misc nvmap ip_tables x_tables [last unloaded: mtd]
Sep 21 04:17:46 localhost kernel: [130358.514449] CPU: 5 PID: 1824419 Comm: CPS-LV Tainted: G           OE     5.10.104-tegra #1
Sep 21 04:17:46 localhost kernel: [130358.514450] Hardware name: Unknown NVIDIA Orin NX Developer Kit/NVIDIA Orin NX Developer Kit, BIOS r35.3.1-5e812e4-dirty 11/05/2023
Sep 21 04:17:46 localhost kernel: [130358.514453] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
Sep 21 04:17:46 localhost kernel: [130358.514456] pc : dev_watchdog+0x394/0x3a0
Sep 21 04:17:46 localhost kernel: [130358.514457] lr : dev_watchdog+0x394/0x3a0
Sep 21 04:17:46 localhost kernel: [130358.514459] sp : ffff80001002bd50
Sep 21 04:17:46 localhost kernel: [130358.514460] x29: ffff80001002bd50 x28: 0000000000000004 
Sep 21 04:17:46 localhost kernel: [130358.514462] x27: 0000000000000004 x26: 0000000000000140 
Sep 21 04:17:46 localhost kernel: [130358.514465] x25: ffff7d8b3c3ce440 x24: 00000000ffffffff 
Sep 21 04:17:46 localhost kernel: [130358.514467] x23: ffff7d8b39a603dc x22: ffffdc5d8f706000 
Sep 21 04:17:46 localhost kernel: [130358.514469] x21: ffff7d8b39a60000 x20: ffff7d8b39a60480 
Sep 21 04:17:46 localhost kernel: [130358.514471] x19: 0000000000000000 x18: 0000000000000000 
Sep 21 04:17:46 localhost kernel: [130358.514473] x17: 0000000000000000 x16: ffffdc5d8da03210 
Sep 21 04:17:46 localhost kernel: [130358.514476] x15: ffff7d8b01123f70 x14: ffffffffffffffff 
Sep 21 04:17:46 localhost kernel: [130358.514478] x13: ffffdc5d8fa29de8 x12: ffffdc5d8fa29a38 
Sep 21 04:17:46 localhost kernel: [130358.514480] x11: 0000000000000200 x10: 0000000000000000 
Sep 21 04:17:46 localhost kernel: [130358.514483] x9 : 00000000fffffffe x8 : 2030206575657571 
Sep 21 04:17:46 localhost kernel: [130358.514485] x7 : 2074696d736e6172 x6 : c0000000ffffefff 
Sep 21 04:17:46 localhost kernel: [130358.514487] x5 : ffff7d8e6841e958 x4 : ffffdc5d8f7279a8 
Sep 21 04:17:46 localhost kernel: [130358.514490] x3 : 0000000000000001 x2 : ffff7d8e6841e960 
Sep 21 04:17:46 localhost kernel: [130358.514492] x1 : 0000000000000000 x0 : 0000000000000000 
Sep 21 04:17:46 localhost kernel: [130358.514495] Call trace:
Sep 21 04:17:46 localhost kernel: [130358.514497]  dev_watchdog+0x394/0x3a0
Sep 21 04:17:46 localhost kernel: [130358.514502]  call_timer_fn+0x3c/0x200
Sep 21 04:17:46 localhost kernel: [130358.514504]  run_timer_softirq+0x464/0x5d0
Sep 21 04:17:46 localhost kernel: [130358.514507]  __do_softirq+0x140/0x3e8
Sep 21 04:17:46 localhost kernel: [130358.514511]  irq_exit+0xc0/0xe0
Sep 21 04:17:46 localhost kernel: [130358.514516]  __handle_domain_irq+0x74/0xd0
Sep 21 04:17:46 localhost kernel: [130358.514517]  gic_handle_irq+0x68/0x134
Sep 21 04:17:46 localhost kernel: [130358.514519]  el0_irq_naked+0x4c/0x54
Sep 21 04:17:46 localhost kernel: [130358.514521] ---[ end trace 006a08105f60bcf4 ]---
Sep 21 04:17:46 localhost kernel: [130358.519411] r8168 0008:01:00.0 eth1: Transmit timeout reset Device!
Sep 21 04:17:46 localhost kernel: [130358.544354] r8168 0008:01:00.0 eth1: Device reseting!
Sep 21 04:17:46 localhost systemd-networkd[593]: eth1: Lost carrier
Sep 21 04:17:51 localhost systemd-networkd[593]: eth1: Gained carrier
Sep 21 04:17:51 localhost NetworkManager[604]: <info>  [1758399471.6540] device (eth1): carrier: link connected
Sep 21 04:17:51 localhost kernel: [130363.573254] r8168: eth1: link up
Sep 21 04:17:51 localhost systemd-timesyncd[318]: Network configuration changed, trying to establish connection.
Sep 21 04:18:31 localhost systemd-resolved[805]: Using degraded feature set (UDP) for DNS server 8.8.8.8.

syslog+kernlog.zip (5.6 MB)

Latest release has a driver upgrade of r8168. Please upgrade the BSP.

已经更新最新的r8168驱动

网口RTL原厂回复:Log已請原廠AE看過, 初步來看是網卡這邊沒有得到系統沒有響應tx, 故經過一段時間網卡會做reset, 但reset後上層仍沒有回應, 這樣網路就無法恢復

网口RTL原厂的怀疑是soc长时间没有响应导致的

please check if you could reproduce issue with NV devkit on 5.1.5.

Ethernet Disconnects with “r8168: eth0: link up” Message in kern.log During Deep Stream RTSP Streaming - Jetson & Embedded Systems / Jetson Orin Nano - NVIDIA Developer Forums

我查看了一下论坛,论坛里面也有相似的问题,请问一下,目前有那些版本已经修复了此问题,谢谢

The patch on that link is to upgrade the r8168 driver.

If you claimed your driver is already the latest, then those patch should already be included.

我的意思,除了jp6.0上解决这个问题,jp5.1.5或者jp5.1.4上有没有解决

Only rel-35.6.1 has patch for latest r8168 driver.

那目前只有jp5.1.5上才有此补丁,其他版本有没有解决,例如jp6.2上等

jp5.1.5跟jp6.2都有包含這些patch. 其餘舊的release沒有.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.