自主设备sudo reboot -f后重启失败

当我在我搭载了orin自定义设备上输入sudo reboot -f软重启后,orin设备拒绝重启并循环在串口终端打印报错:

[ 9779.688642] nvgpu: 17000000.ga10b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[ 9779.688677] nvgpu: 17000000.ga10b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[ 9779.688695] nvgpu: 17000000.ga10b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch
[ 9779.689551] CPU:0, Error: cbb-fabric@0x13a00000, irq=33
[ 9779.689558] **************************************
[ 9779.689559] CPU:0, Error:cbb-fabric, Errmon:2
[ 9779.689565] Error Code : TIMEOUT_ERR
[ 9779.689567] Overflow : Multiple TIMEOUT_ERR
[ 9779.689574]
[ 9779.689575] Error Code : TIMEOUT_ERR
[ 9779.689576] MASTER_ID : CCPLEX
[ 9779.689577] Address : 0x2310b0c
[ 9779.689578] Cache : 0x1 – Bufferable
[ 9779.689580] Protection : 0x2 – Unprivileged, Non-Secure, Data Access
[ 9779.689582] Access_Type : Read
[ 9779.689582] Access_ID : 0x10
[ 9779.689584] Fabric : cbb-fabric
[ 9779.689584] Slave_Id : 0x35
[ 9779.689585] Burst_length : 0x0
[ 9779.689586] Burst_type : 0x1
[ 9779.689587] Beat_size : 0x2
[ 9779.689588] VQC : 0x0
[ 9779.689588] GRPSEC : 0x7e
[ 9779.689589] FALCONSEC : 0x0
[ 9779.689591] **************************************
[ 9779.689602] ------------[ cut here ]------------
[ 9779.689604] WARNING: CPU: 0 PID: 76 at drivers/soc/tegra/cbb/tegra234-cbb.c:577 tegra234_cbb_isr+0x120/0x160
[ 9779.689614] Modules linked in: fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter mttcan can_dev lzo_rle lzo_compress can_raw can zram overlay ramoops reed_solomon snd_soc_tegra210_ope snd_soc_tegra186_dspk snd_soc_tegra186_asrc snd_soc_tegra186_arad snd_soc_tegra210_iqc snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_dmic snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra210_sfc snd_soc_tegra_pcm loop aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 sha1_ce snd_soc_spdif_tx snd_soc_tegra_machine_driver pps_gen_gpio heart_ctl userspace_alert nct1008 snd_soc_tegra210_adsp snd_soc_tegra_utils snd_soc_simple_card_utils tegra_bpmp_thermal nvadsp snd_soc_tegra210_ahub tegra210_adma snd_soc_rt5640 snd_hda_codec_hdmi snd_soc_rl6231 ar0233yuv snd_hda_tegra
[ 9779.689674] snd_hda_codec snd_hda_core spi_tegra114 ina3221 pwm_fan nvgpu binfmt_misc nvmap ip_tables x_tables [last unloaded: mtd]

您是否能给我一些提示是哪里出了问题。
我使用的是jetpack5.1.0
以下附件是我的完整的串口终端的打印。

dmesg.txt (1.3 MB)

請問你們的板子上接了什麼東西?
看起來像是有什麼driver沒辦法正常釋放 害整台機器卡死
還有請問單純開機是正常的?

我们接了5路摄像头和激光雷达和毫米波雷达。请问从报错来看是因为什么driver没有正常释放呢?这种情况是一个概论问题偶尔会发生。单纯reboot关机也有概论会出现这种情况。

下面是我们另外一些资料,我不知道会不会有帮助对于问题排查。
另外我们有使用到看门狗。通过dds发心跳,丢失一定心跳就认为系统异常就会调用看门狗重启。
我们还有一个守护程序会保护其他的进程,当其他的进程被关闭时,会被守护进程重新拉起来。

syzk5
上面的这些进程名称在我发给你的串口日志上面也有报错我发现。

你只插著設備 不開任何上面你提到的這些process會導致關機失敗?

看不太懂你這句的意思
那聽起來你關機前應該要先把這個daemon殺掉才能正常關?

我只插着设备,关闭我的程序。关机不会导致reboot失败。

daemon只是为了防止其他工作人员把我们的进程非法关闭。不需要kill daemon之后关机。

我理解不管怎麼样我使用sudo reboot,我的进程都应该被reboot正常的终止,正常的

从串口日志的打印上面能看出来是哪个设备或者程序出现异常了吗?

你要不要先試試看這個

那幾支syzk-xxx的程式怎麼寫的應該是你比較清楚才對

为什么要先把daemon杀掉呢?sudo reboot后系统不是会发送信号给所有正在运行的进程,通知它们系统即将重启。这使得进程有机会执行清理操作并安全地终止。这里面也应该包括daemon。

你試過沒有用的話我們再看看還可以怎麼debug
我不知道你的程式怎麼寫的也只能這樣建議 我只是覺得你描述的這個行為有點奇怪

您好我还想在问一下这些报错是什么导致的呢?他们没有影响重启吗?

还有这个nvgpu: 17000000.ga10b nvgpu_gpu_set_deterministic_ch_railgate:1858 [WRN] cannot busy to restore deterministic ch报错是什么导致的呢?

因为我发现这些报错出现在我syzk-*之前

你只接雷達 沒有接camera的話也會這樣嗎?
GPU error看起來應該和camera driver/device tree有關

您好。这个我会去把摄像头的程序关闭了在试一下。
这个报错呢 CPU:0, Error:cbb-fabric, Errmon:2

我刚才又试了一次我发现每次在内核打印中出现Maximum registrations reached打印后,重启100%会出现重启失败现象。附件是完整的串口打印信息。

[ 2705.446113] Maximum registrations reached
[ 2900.542793] Maximum registrations reached
[ 4289.019506] Maximum registrations reached
[ 4337.853030] Maximum registrations reached
[ 4544.557438] Maximum registrations reached
[ 4737.104732] Maximum registrations reached
[ 4844.293233] Maximum registrations reached
[ 5278.325980] Maximum registrations reached
[ 5437.801013] Maximum registrations reached
[ 5746.435512] Maximum registrations reached
[ 6119.074736] Maximum registrations reached
[ 6435.625756] CPU:0, Error: cbb-fabric@0x13a00000, irq=33
[ 6435.625768] **************************************
[ 6435.625769] CPU:0, Error:cbb-fabric, Errmon:2
[ 6435.625774] Error Code : TIMEOUT_ERR
[ 6435.625776] Overflow : Multiple TIMEOUT_ERR
[ 6435.625783]
[ 6435.625784] Error Code : TIMEOUT_ERR
[ 6435.625785] MASTER_ID : CCPLEX
[ 6435.625786] Address : 0x2310b0c
[ 6435.625787] Cache : 0x1 – Bufferable
[ 6435.625788] Protection : 0x2 – Unprivileged, Non-Secure, Data Access
[ 6435.625790] Access_Type : Read
[ 6435.625791] Access_ID : 0x14
[ 6435.625792] Fabric : cbb-fabric
[ 6435.625793] Slave_Id : 0x35
[ 6435.625794] Burst_length : 0x0
[ 6435.625795] Burst_type : 0x1
[ 6435.625795] Beat_size : 0x2
[ 6435.625796] VQC : 0x0
[ 6435.625797] GRPSEC : 0x7e
[ 6435.625798] FALCONSEC : 0x0
[ 6435.625799] **************************************
[ 6435.625815] WARNING: CPU: 0 PID: 76 at drivers/soc/tegra/cbb/tegra234-cbb.c:577 tegra234_cbb_isr+0x120/0x160
[ 6435.625947] —[ end trace 0000000000000002 ]—

reboot_test_err01.log (2.0 MB)

您好我这个项目比较着急希望得到您的帮助提前感谢您的帮助

這個是EQOS driver. 請問你們的板子有使用RGMII?

請打上這個patch測試看看 看起來是後面BSP已經解掉的問題

diff --git a/drivers/platform/tegra/ptp-notifier.c b/drivers/platform/tegra/ptp-notifier.c
index 9da644b..7fbdabc 100644
--- a/drivers/platform/tegra/ptp-notifier.c
+++ b/drivers/platform/tegra/ptp-notifier.c
@@ -1,7 +1,7 @@
 /*
  * drivers/platform/tegra/ptp-notifier.c
  *
- * Copyright (c) 2018-2022, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2018-2023, NVIDIA CORPORATION.  All rights reserved.
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -104,30 +104,44 @@
 	raw_spin_lock_irqsave(&ptp_notifier_lock, flags);
 	if (!intf_name || !ts) {
 		pr_err("passed Interface_name or time-stamp ptr is NULL");
-		raw_spin_unlock_irqrestore(&ptp_notifier_lock, flags);
-		return -1;
+		ret = -1;
+		goto err_put;
 	}
+
+	/* dev_get_by_name increments the dev reference and requires dev_put */
+
 	dev = dev_get_by_name(&init_net, intf_name);
 
-	if (!dev || !(dev->flags & IFF_UP)) {
-		pr_debug("dev is NULL or intf is not up for %s\n", intf_name);
-		raw_spin_unlock_irqrestore(&ptp_notifier_lock, flags);
-		return -EINVAL;
+	if (!dev) {
+		pr_debug("No device found for %s\n", intf_name);
+		ret = -EINVAL;
+		goto err_put;
 	}
+
+	if (!(dev->flags & IFF_UP)) {
+		pr_debug("interface is not up for %s\n", intf_name);
+		ret = -EINVAL;
+		goto err_put;
+	}
+
 	for (index = 0; index < MAX_MAC_INSTANCES; index++) {
 		if (dev == registered_ndev[index])
 			break;
 	}
 	if (index == MAX_MAC_INSTANCES) {
 		pr_debug("Interface: %s is not registered to get HW time", intf_name);
-		raw_spin_unlock_irqrestore(&ptp_notifier_lock, flags);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto err_put;
 	}
 
 	if (get_systime[index])
 		ret = (get_systime[index])(dev, ts, ts_type);
 	else
 		ret = -EINVAL;
+
+err_put:
+	if (dev)
+		dev_put(dev);
 	raw_spin_unlock_irqrestore(&ptp_notifier_lock, flags);
 
 	return ret;

是的使用了marvell的mv6321的芯片