Failure during reboot from debug interface

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.8.1

Target Operating System
Linux

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)

SDK Manager Version
1.9.3.10904

Host Machine Version
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers

Hi, I usually reboot Orin through debug interface by using two “poweroff” and “poweron” commands. This ways is quite easy mostly from remote.
Unfortunately, lately when I follow this procedure most of the time it doesn’t power on successfully.

This is the output when it power on successfully:

ERROR: MCU_PLTFPWRMGR: WDT reset (or aurixreset force) triggered...
MCU_FOH: E2E Initialized
MCU_FOH: Initialization done
INFO: MCU_PLTFPWRMGR: Powering up 
INFO: MCU_PLTFPWRMGR: Voltage Monitoring is enabled.
INFO: MCU_PLTFPWRMGR: Send command 'voltageMonitor disable' on console to disable Voltage Monitoring.
INFO: MCU_PLTFPWRMGR: Monitoring for KL30 (G3CH4) started.
INFO: MCU_PLTFPWRMGR: KL30 (G3CH4) Voltage (17937mV) exceeded threshold (6700mV). Continuing...!
INFO: PLTFPWRMGR_IOHWABS: Monitoring for PREREG_5V (G2CH2) started.
INFO: PLTFPWRMGR_IOHWABS: PREREG_5V (G2CH2) Voltage (4944mV) exceeded threshold (4850mV). Continuing...
INFO: MCU_PLTFPWRMGR: Monitoring for PREREG_SENSE_16V (G3CH6) started.
INFO: MCU_PLTFPWRMGR: PREREG_SENSE_16V (G3CH6) Voltage (12151mV) exceeded threshold (6700mV). Continuing...!
INFO: MCU_PLTFPWRMGR: Request Power-up Ethernet Switch done !
INFO: MCU_PltfPwrMgr: Linkup status is not active
INFO: MCU_PltfPwrMgr: 88Q5072 OAK Link Active
INFO:Marvell switch: 0: NW_CFG_BASE
INFO:Marvell switch: Marvell Oak/Spruce configuration completed..
INFO: MCU_PltfPwrMgr: Switch Init
INFO: Marvell Phy: Phy init begin..
INFO: Marvell Phy: Phy init completed..
INFO: MCU_PltfPwrMgr: PHYs Init
INFO: MCU_PltfPwrMgr: StbM Init
INFO: MCU_PLTFPWRMGR: Request Eth initialization done !
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS11-1..........................................................
Check for VRS11-2...........................................................
Check for VRS11-1..........................................................
Check for VRS11-2...........................................................
Check for VRS11-1..........................................................
Check for VRS11-1..........................................................
Check for VRS11-1..........................................................
Check for VRS11-1..........................................................
Check for VRS11-2...........................................................
Check for VRS11-2...........................................................
Check for VRS11-2...........................................................
Check for VRS11-2...........................................................
Check for VRS10............................................................
Check for VRS10............................................................
INFO: NVMCU_ORINPWRCTRL: FUNC_NIRQ continuous monitoring Enabled!
INFO: SftyMon_IoHwAbs: toggle check of local and remote sensor successfull
INFO: NvMCU_OrinTMON: toggle check of local and remote sensor successfull
INFO: SftyMon_IoHwAbs: Board Temperature sensor initialized
INFO: NvMCU_OrinTMON: Orin Temperature sensor initialized
INFO: MCU_PLTFPWRMGR: Orin TMON enabled .... 
INFO: MCU_PLTFPWRMGR: Board TMON enabled.
INFO: BtChn_Cfg: Tegra x1 Boot Chain is : A 
INFO: BtChn_Cfg: Bootchain Pin 0xc set to 0
INFO: BtChn_Cfg: Bootchain Pin 0x150 set to 0
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
INFO: NVMCU_ORINPWRCTRL: Tegra reset released
MCU_FOH: MCU FOH : Initiate SOC Error Pin Monitoring & SPI communication
MCU_FOH: MCU FOH : Start Monitoring Initiated
INFO: MCU_PLTFPWRMGR: Power-up sequence is complete !
INFO: CmnIf: Wdg Enabled Success!
INFO: MCU_PLTFPWRMGR: DL Ready drive to HIGH!
Power on the system
INFO : MCU_ISTMGR: IST_DONE pin no monitor time of 2000 ms over.
MCU_FOH: Spi Transmit Started
MCU_FOH: SOC error pin is de-asserted
ERROR: MCU_ERRHANDLER: SOC error pin is de-asserted
MCU_FOH: Periodic Status [0] 0xab [1] 0xcd [250] 0x12 [251] 0x34
INFO: SftyMon_IoHwAbs: PG_VRS11 monitoring started...
INFO : MCU_ISTMGR: IST Manager initialized to send/receive commands 
INFO : IST_TESTAPP: IST Result Ready to fetch
MCU_FOH: SOC error pin is asserted
ERROR: MCU_ERRHANDLER: SOC error pin is asserted
MCU_FOH: ErrReport: ErrorCode-0x28da ReporterId-0xe04c Error_Attribute-0x0 Timestamp-0x146cebf5
INFO : MCU_ISTMGR: IST Manager initialized to send/receive commands 
INFO : IST_TESTAPP: IST Result Ready to fetch
INFO : MCU_ISTMGR: IST Manager initialized to send/receive commands 
INFO : IST_TESTAPP: IST Result Ready to fetch
INFO : MCU_ISTMGR: IST Manager initialized to send/receive commands 
INFO : IST_TESTAPP: IST Result Ready to fetch
INFO: MCU_SWC_FanControl: max_rpm fan2 : 0
INFO: MCU_SWC_FanControl: maxrpm of fan2 has more than 50 percent deviation against rated maxrpm
INFO : MCU_ISTMGR: IST Manager initialized to send/receive commands 
INFO : IST_TESTAPP: IST Result Ready to fetch
INFO : MCU_ISTMGR: IST Manager initialized to send/receive commands 
INFO : IST_TESTAPP: IST Result Ready to fetch

This is the output when it fails:

Info: Executing cmd: poweroff, argc: 0, args: 
NvShell>INFO: MCU_PLTFPWRMGR: Powering off 
INFO: MCU_PLTFPWRMGR: VRS11 PG Monitoring disable.
INFO: NVMCU_ORINPWRCTRL: Wait for Safe Shutdown notification (20s max)
MCU_FOH: SPI : E2E_P05Check Status : 7 : 112
ERROR: MCU_ERRHANDLER: McuFoh : ReporterID - 0x810E ErrorCode - 0x3 
 MCU_FOH: SPI : E2E_P05Check Status : 7 : 0
MCU_FOH: SPI : E2E_P05Check Status : 7 : 0
INFO: MCU_PltfPwrMgr: Linkup status is not active
INFO: MCU_PltfPwrMgr: Linkup status is not active
INFO: MCU_PltfPwrMgr: Ethernet peripherals de-initialized
INFO: MCU_PLTFPWRMGR: DL Ready drive to LOW !
INFO: MCU_PLTFPWRMGR: Orin TMON disable.
INFO: MCU_PLTFPWRMGR: Board TMON disable.
INFO: PLTFPWRMGR_IOHWABS: Power down sequence is complete !
Command Executed
poweron
Info: Executing cmd: poweron, argc: 0, args: 
NvShell>INFO: MCU_PLTFPWRMGR: Powering up 
INFO: MCU_PLTFPWRMGR: Voltage Monitoring is enabled.
INFO: MCU_PLTFPWRMGR: Send command 'voltageMonitor disable' on console to disable Voltage Monitoring.
INFO: MCU_PLTFPWRMGR: Monitoring for KL30 (G3CH4) started.
INFO: MCU_PLTFPWRMGR: KL30 (G3CH4) Voltage (17925mV) exceeded threshold (6700mV). Continuing...!
INFO: PLTFPWRMGR_IOHWABS: Monitoring for PREREG_5V (G2CH2) started.
INFO: PLTFPWRMGR_IOHWABS: PREREG_5V (G2CH2) Voltage (4944mV) exceeded threshold (4850mV). Continuing...
INFO: MCU_PLTFPWRMGR: Monitoring for PREREG_SENSE_16V (G3CH6) started.
INFO: MCU_PLTFPWRMGR: PREREG_SENSE_16V (G3CH6) Voltage (12146mV) exceeded threshold (6700mV). Continuing...!
INFO: MCU_PLTFPWRMGR: Request Power-up Ethernet Switch done !
INFO:Marvell switch: 0: NW_CFG_BASE
INFO:Marvell switch: Marvell Oak/Spruce configuration completed..
INFO: MCU_PltfPwrMgr: Switch Init
INFO: MCU_PltfPwrMgr: Linkup status is not active
INFO: MCU_PltfPwrMgr: 88Q5072 OAK Link Active
INFO: Marvell Phy: Phy init begin..
INFO: Marvell Phy: Phy init completed..
INFO: MCU_PltfPwrMgr: PHYs Init
INFO: MCU_PltfPwrMgr: StbM Init
INFO: MCU_PLTFPWRMGR: Request Eth initialization done !
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS11-1..........................................................
Check for VRS11-2...........................................................
Check for VRS11-1..........................................................
Check for VRS11-2...........................................................
Check for VRS11-1..........................................................
Check for VRS11-1..........................................................
Check for VRS11-1..........................................................
Check for VRS11-1..........................................................
Check for VRS11-2...........................................................
Check for VRS11-2...........................................................
Check for VRS11-2...........................................................
Check for VRS11-2...........................................................
Check for VRS10............................................................
Check for VRS10............................................................
INFO: NVMCU_ORINPWRCTRL: FUNC_NIRQ continuous monitoring Enabled!
INFO: SftyMon_IoHwAbs: toggle check of local and remote sensor successfull
INFO: NvMCU_OrinTMON: toggle check of local and remote sensor successfull
INFO: NvMCU_OrinTMON: Orin Temperature sensor initialized
INFO: SftyMon_IoHwAbs: Board Temperature sensor initialized
INFO: MCU_PLTFPWRMGR: Orin TMON enabled .... 
INFO: MCU_PLTFPWRMGR: Board TMON enabled.
INFO: BtChn_Cfg: Tegra x1 Boot Chain is : A 
INFO: BtChn_Cfg: Bootchain Pin 0xc set to 0
INFO: BtChn_Cfg: Bootchain Pin 0x150 set to 0
Check for VRS10............................................................
Check for VRS10............................................................
Check for VRS10............................................................
INFO: NVMCU_ORINPWRCTRL: Tegra reset released
MCU_FOH: MCU FOH : Initiate SOC Error Pin Monitoring & SPI communication
MCU_FOH: MCU FOH : Start Monitoring Initiated
INFO: MCU_PLTFPWRMGR: Power-up sequence is complete !
INFO: CmnIf: Wdg Enabled Success!
INFO: MCU_PLTFPWRMGR: DL Ready drive to HIGH!
Power on the system
Command Executed
INFO : MCU_ISTMGR: IST_DONE pin no monitor time of 2000 ms over.
MCU_FOH: SOC error pin is asserted
ERROR: MCU_ERRHANDLER: SOC error pin is asserted
MCU_FOH: Spi Transmit Started
INFO: SftyMon_IoHwAbs: PG_VRS11 monitoring started...
ERROR: SftyMon_IoHwAbs: VRS11 PGOOD Failure notification...!
ERROR: MCU_ERRHANDLER: SftyMon : ReporterID - 0x8110 ErrorCode - 0x1A 
 ERROR: MCU_ERRHANDLER: PGOOD failure notification...error detected by VRS11
INFO: MCU_SWC_FanControl: max_rpm fan2 : 0
INFO: MCU_SWC_FanControl: maxrpm of fan2 has more than 50 percent deviation against rated maxrpm

Do you have any idea of how to solve this? Other ways to reboot Orin are appreciated too. Thank you.

Why does the “WDT reset” message appear as the first line? Was it from the previous executed “poweroff” command?

Please execute the ‘version’ command to provide the Aurix version.

Version

Info: Executing cmd: version, argc: 0, args: 
DRIVE-V6.0.8-P3710-AFW-Aurix-StepB-5.12.03
Command Executed

At the moment when I try to poweroff and poweron I am not able anymore to poweron successfully so I cannot provide the output for the successfully poweron

Thus, how did you get the log for the successful power-on? Can you provide the log for a successful power-on after executing the ‘poweron’ command now?

Additionally, could you try to resolve the issue by reflashing the Aurix firmware (as per Flashing AFW from Orin Using Force Update) or by reflashing DRIVE OS?

Thus, how did you get the log for the successful power-on?

At that time sometimes it worked and sometimes it didn’t. Now it always doesn’t, so I cannot provide the log for successful power-on after executing the ‘poweron’ command.

Additionally, could you try to resolve the issue by reflashing the Aurix firmware (as per Flashing AFW from Orin Using Force Update) or by reflashing DRIVE OS?

This is a critical time for us, so I cannot reflash either one or the other.
But I found a solution. I can reboot using the following command:

sudo su
echo 1 > /sys/class/tegra_hv_pm_ctl/tegra_hv_pm_ctl/device/trigger_sys_reboot

Thank you for your help.