NX Shutdown at 65C

I have a Photon carrier board and production NX that auto shuts down at about 65C.

Before the shutdown, I can see these reports of soctherm: OC ALARM 0x00000001. I am using a 12V 10Amp power supply. I guess it shouldn’t be shutting down at 65 deg if the carrier board spec says 85 C and nvidia spec says 90 C. So presumably it’s a power issue.

Could some of the system throttling cause an increase in power consumption which tips the system over the edge with a supply that it already complains is insufficient?

The fan remains at a constant manual speed throughout.

In the past, when I’ve had unwanted shutdown issues, they typically happened at the moment my inference app started. These instances were resolved by using a voltage at the higher end of the carrier board input band. However, the photon only supports 12v +5% so I’d have to find a 12.6v supply (if lack of sufficient voltage is the issue).

571.987684] soctherm: OC ALARM 0x00000001
[ 573.008026] soctherm: OC ALARM 0x00000001
[ 573.905596] FAN rising trip_level:1 cur_temp:59400 trip_temps[2]:60000
[ 574.046187] soctherm: OC ALARM 0x00000001
[ 575.021423] FAN rising trip_level:1 cur_temp:59600 trip_temps[2]:60000
[ 575.098128] soctherm: OC ALARM 0x00000001
[ 576.112792] soctherm: OC ALARM 0x00000001
[ 577.158267] soctherm: OC ALARM 0x00000001
[ 578.170506] soctherm: OC ALARM 0x00000001
[ 578.381247] FAN rising trip_level:1 cur_temp:59550 trip_temps[2]:60000
[ 579.206227] soctherm: OC ALARM 0x00000001
[ 579.501224] FAN rising trip_level:1 cur_temp:59600 trip_temps[2]:60000
[ 580.227430] soctherm: OC ALARM 0x00000001
[ 581.273666] soctherm: OC ALARM 0x00000001
[ 581.741196] FAN rising trip_level:1 cur_temp:59750 trip_temps[2]:60000

The OC alarm shall only throttle the performance of your board. Not causing a reboot or shutdown.

Please check the tegrastats before it goes shutdown.

@WayneWWW Here’s the log from tegrastats
shutdown1.txt (12 KB)

Hi, can you capture the system voltage drop when shut down? The thermal looks well since it is only 80C at the moment. I suspect it might be caused by power supply.

I have now purchased a variable power supply that goes up to 10A. I have set it to 12.22VDC.

If my jetson has no m.2 video capture card connected, then I saw the system run for 9 hours before i manually stopped it.

However, with a m.2 video capture card ( have one from two different brands), then it will automatically reboot after 10-20 minutes.

I checked dmesg and I can see PCI errors so I created a forum post here to see if we can prevent the errors. PCIe Bus Error

However, I believe the reboot only occurs when OC overcurrent starts to get triggered due to the temperature. And once the OC current starts, that is when I get PCI errors. And then when I get PCI errors, I get a shutdown.

I have also tried using an M.2 extension board to keep the video capture card outside of my enclosure so that it remains cool but that didnt help.

this is the last thing on Dmesg before it shutdown last time

I’ve turned all these off but it still has errors and still shutdown
pcie_aspm=off pci=nomsi pci=noaer iommu=1 amd_iommu=on ASPCM=off