After power off command, IGX Orin fan run in full speed

Hi Nvidia:
As the title said, after poweroff the IGX orin, it sometime can’t poweroff successfully.
Normally, it should go like:

poweroff command --> SOL printout some message about poweroff --> IGX Orin fan run full speed --> turned off gracefully.

We test poweroff for 50 times, only 48 times were turned down successfully. The rest 2 times act like:

poweroff command --> SOL printout some message about poweroff --> IGX Orin fan run full speed --> last forever

From SOL, nothing is printed out when the issue is happening.
If further information is needed, please let us know.

Please help.

Many thanks.

Hi jameskuo,

Are you using IGX developer kit(Non-ERoT) or IGX Board kit(ERoT)?

Do you run thesudo poweroff test manually?
Please share the full serial console logs in both cases for further check.

Hi Kevin:
Thanks for the reply.

We are using IGX Board kit (ERoT).

We use top-right menu to select poweroff.
Should be the same as sudo poweroff?

Sure things. But the last logs we captured looks like:

Not sure why the logs looks like this. We will try to capture another one.

Many Thanks.

It seems caused from the interference in serial console.

How did you capture these logs? through ssh with port 2200?

They should be similar. Have you updated each firmwares(BMC, IGX OS, SMCU…etc) on your board to IGX SW 1.0?

Hi Kevin:
Thanks for the reply.

These logs were captured from SOL console on BMC WebUI. I think it’s same as the port 2200.

Yes. BMC to version 24, IGX OS to 1.0.3, SMCU to 1.50.14, CX7, GPU vBIOS are all updated.

And, we get two readable serial log files attached here:
Power off fail_20240722_1720.txt (23.4 KB)
Power off fail_20240722_1711.txt (59.1 KB)

Many thanks.

Is this failed log?

Is this pass log?

Sorry for the misleading. Both of them are fail log.

You logs look fine.

I just run sudo poweroff on my IGX developer kit and it shows the similar logs as yours.
After poweroff, the fan will run for few seconds and then stop.

Is there any difference between the pass and fail log?

Hi Kevin:
Unfortunately, the diff of two logs does not show any useful information.
Will the power supply be the root cause?
Why the fan went full speed after power off?

Are you using the default power supply or your own?
It should be turned off after power off as I’ve verified on my setup.

Hi Kevin:
We are using our own power supply, 850w. We are running the test using another power supply.
We found that:

  1. In the BMC web UI / Server power operations / Power status, it showed that the system power is off. What’s more, if we ssh into the BMC and check the power status via ipmitool chassis power status, it showed that Chassis Power if off.

  2. We found that the PS_ON# in ATX 24 pin remain LOW when the issue was happening. It was suppose to be HIGH after we perform a power off action. Other ATX 24 pins, such as 3v3, 12v, is still supplying. Maybe someone is supposed to pull HIGH the PS_ON# pin, but didn’t make it because of some strange reason?

  3. We found a workaround to turn IGX orin off, is to use BMC power off feature. When the issue is happening, we can ssh into BMC and use ipmitool chassis power off command to turn IGX orin off. The PS_ON# pin is pulled HIGH after a few seconds, and the system seems to power off successfully.

Many thanks.

It seems the expected result to me.

Please let me check this with HW team.

Do you mean that the issue is caused from that you cannot power off it successfully through running sudo poweroff?

From the previous serial console log you shared, it seems power off as expected.
Have you also checked if there’s any error messages in BMC console through running journalctl -f?

Hi Kevin:

Yes. Sometimes, sudo poweroff will lead to an non-stop full fan run situation. Reset button, power button didn’t help at all. Previously, to escape from this issue, we just unplug the power cable then restart the whole system. Now we can use BMC, which is suppose to be a more peaceful way to escape.

We didn’t check journalctl -f in BMC. We will try to capture them.

Many thanks.

Which FAN do you mean? The one on the module or the one in front of the case.
How about the fail rate?

Please also share the result of the following command in SMCU console on your board.

# Shell>version

Hi Kevin:
We connected J44 for CPU Fan、J64 for System Fan and JP21 for CX7 Fan.
All of them are running in 100% PWM.

Fail rate is about 1/20. It’s pretty random.

SMCU Version:

Shell>version

DDPX Aurix Serial Console
P3740-QS2
with TLF35584 B/C-Step
SW Version 1.50.03
DRIVE-v6.x.x-E3550-NV-Aurix-IFW-StepB-1.50.3
Patch 0
* 6.0.7.0 release
    Bug 3800122 - VM_EN1 at poweron
    Bug 3955478 - P1 -> P0 fix for 3663
    Bug 3895793 - fix for showvoltages on P3663
*

TC397 Step BD

We use the FW on Download Center to upgrade SMCU FW, and it’s updated via BMC Web UI.

Many Thanks.

Hi Kevin:
These are logs from BMC:
BMC_journalctl_abnormal.txt (123.6 KB)
BMC_journalctl_normal.txt (47.7 KB)

This is how we capture the logs from journalctl -f:

  1. The System is poweroff. Use a host to ssh into BMC, clear everything, run journalctl -f in BMC.

  2. Power on the system via power btn.

  3. Power off the system via cmd sudo poweroff

  4. Collect the log.
    4.a If the system is acting normal (power down peacefully), the journalctl -f will also stopped. As a result, we can copy it directly.
    4.b If the system is acting abnormal (fan running in full speed), the journalctl -f will keep sending Heartbeat message. We have to Ctrl+C to stop it then copy the logs out.

  5. After collecting the log, Ctrl+C to stop it, clear the terminal, back to step 1, until we get normal and abnormal logs.

We took a brief scan about the logs. We only observed that the Heartbeats message kept sending in abnormal state while the normal state just stop logging.

Many thanks.

Thanks for sharing the info.

Could you check the state of the following resistors for both normal and abnormal case?
R1272: AURIX_MOD_SHUTDOWN_N_IN
R1263: AURIX_IN_PWR_ON

and also the result of # showport 32 6 (run in SMCU console) in both cases.

You can also check if you can receive the following messages in SMCU console when you hit the issue.

Thermal(1)/Force(1) Shutdown pin deasserted; Poweroff System
Module Poweron disabled, Checking these signals:
Carrier poweron (0) time
System_PowerOff: Carrier PON disabled
ATX Poweron disabled, Checking these signals:
ATX PGD (0) time
System_PowerOff: ATX PGD disabled
Power off the system
Command Executed

Hi Kevin:
Thanks for the reply.
This is the raw log from SMCU:
smcuLogs.txt (2.3 KB)

First, when the OS is on, we check the port using showport 32 6 command. It showed:

Shell>showport 32 6
P32.6: Direction = OUT; Push Pull; General-purpose output; State = HIGH 
pad drive = 2 
INPUT Level: cmosAuto
Driver: Medium driver

Then, we sudo poweroff the os. After a success power off, the SMCU print out:

Shell>Thermal(1)/Force(1) Shutdown pin deasserted; Poweroff System
Module Poweron disabled, Checking these signals:
Carrier poweron (0) time
System_PowerOff: Carrier PON disabled
ATX Poweron disabled, Checking these signals:
ATX PGD (0) time
System_PowerOff: ATX PGD disabled
Power off the system
Command Executed

Now, the OS is off. We showport 32 6 again:

Shell>showport 32 6
P32.6: Direction = OUT; Push Pull; General-purpose output; State = LOW 
pad drive = 2 
INPUT Level: cmosAuto
Driver: Medium driver

Next, we start to test the on/off testing again.
Note: We found that become more difficult to trigger the issue when the SMCU shell is on.

When the power button is pressed, SMCU printed out:

shell>Power Button pressed < 10s
Poweron Command
is_sys_in_sc7_suspend
System_PowerOn: P3740 Poweron Sequence start
ATX enabled, Checking for ATX_PWRGD.
System_PowerOn: Reading ATX PG status: Passed
Module Poweron enabled, Checking these signals:
Carrier poweron
System_PowerOn: Reading Carrier PON: Passed
SYS_RST_IN is low
FORCE_RECOVERY_N is low
Initialed SPI 0 at mode 1 freq 5000000 
ASSERTION WARNING 'FALSE' in ../src/4_McHal_TC397B/Tricore/Qspi/Std/IfxQspi.c:91 (function 'IfxQspi_calculateExtendedConfigurationValue()')
Set ECU_READY_OUT P11.14 high
Power on the system
Command Executed
Tegra Reset
Command Executed

Then, we turned off the system. During the issue is happening, fan run in full speed, and SMCU printed out nothing. Now, we check the showport 32 6 (serveral times)

Shell>showport 32 6
P32.6: Direction = OUT; Push Pull; General-purpose output; State = HIGH 
pad drive = 2 
INPUT Level: cmosAuto
Driver: Medium driver

Next, we used BMC to run command ipmitool chassis power off command to shutdown the system. Now, SMCU printed out:

Shell>Power Button pressed >= 10s
Poweroff Command
Module Poweron disabled, Checking these signals:
Carrier poweron (0) time
System_PowerOff: Carrier PON disabled
ATX Poweron disabled, Checking these signals:
ATX PGD (0) time
System_PowerOff: ATX PGD disabled
Power off the system
Command Executed

After the system is shutdowned successfully, the showport 32 6 printed out:

showport 32 6
P32.6: Direction = OUT; Push Pull; General-purpose output; State = LOW 
pad drive = 2 
INPUT Level: cmosAuto
Driver: Medium driver

These are the logs we captured.

And about R1272 and R1263, we can’t find these two resistor. Are they located in the back of the board?

Many thanks.

Thanks for the test and share the results.

Could you also help to check the state of P22.7 with the following command when you hit the issue and in normal case?

# showport 22 7

and do you enable SEP on your board?

Hi Kevin:
Thanks for the reply.
We didn’t enable SEP on the board.

About the port 22 7:

Starting with a lots of power on and power off. When the issue is hit, fan run in full speed, we use showport 22 7 to check the status. It shows:

P22.7: Direction = IN; State = LOW
pad drive = 2
INPUT Level: cmosAuto
Driver: Medium driver

After ipmitool chassis power off command in BMC, Shell print out the same command as before (Power Button pressed >= 10s …). showport 22 7 shows:

P22.7: Direction = IN; State = LOW
pad drive = 2
INPUT Level: cmosAuto
Driver: Medium driver

We then power on the system. After the shell printed out Power Button pressed < 10s Poweron Command just like before, we run showport 22 7 again:

P22.7: Direction = IN; state = HIGH
pad drive = 2
INPUT Level: cmos Auto
Driver: Medium driver

Next, we power off the system. During the power off process, we kept watching showport 22 7. It remains state = HIGH during the power off process, until Poweroff system message printout. showport 22 7 print out State = Low after the power off message.

Many Thanks.