Jetson TX2 CPU errors when running chef-client sofware

assertadev · April 13, 2018, 6:03pm

Hello,
We have a cluster of Jetson TX2s that we are using the Chef.io framework to manage. However, we had some instances of reboots when the chef-client software would run on the Jetsons. Upon further inspection, we found errors in kern.log like the following:

Apr 13 15:17:26 basenode kernel: [75064.199185] CPU1: SError detected, daif=140, spsr=0x60000000, mpidr=80000000, esr=be000000
Apr 13 15:17:26 basenode kernel: [75064.199188] CPU2: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000001, esr=be000000
Apr 13 15:17:26 basenode kernel: [75064.199330] ROC:IOB Machine Check Error:
Apr 13 15:17:26 basenode kernel: [75064.199331]         Address Type = Secure DRAM
Apr 13 15:17:26 basenode kernel: [75064.199333]         Address = 0x0 (Unknown Device)
Apr 13 15:17:26 basenode kernel: [75064.199356] **************************************
Apr 13 15:17:26 basenode kernel: [75064.199356] Machine check error in DCC:1:
Apr 13 15:17:26 basenode kernel: [75064.199357]         Status = 0xf400000100000405
Apr 13 15:17:26 basenode kernel: [75064.199357]         Bank does not have any known errors
Apr 13 15:17:26 basenode kernel: [75064.199357]         Overflow (there may be more errors)
Apr 13 15:17:26 basenode kernel: [75064.199358]         Uncorrected (this is fatal)
Apr 13 15:17:26 basenode kernel: [75064.199358]         Error reporting enabled when error arrived
Apr 13 15:17:26 basenode kernel: [75064.199359]         ADDR = 0xb8
Apr 13 15:17:26 basenode kernel: [75064.199389] **************************************
Apr 13 15:17:26 basenode kernel: [75064.199389] CPU1: SError detected, daif=140, spsr=0x60000000, mpidr=80000000, esr=be000000
Apr 13 15:17:26 basenode kernel: [75064.199532] ROC:IOB Machine Check Error:
Apr 13 15:17:26 basenode kernel: [75064.199533]         Address Type = Secure DRAM
Apr 13 15:17:26 basenode kernel: [75064.199536]         Address = 0x0 (Unknown Device)
Apr 13 15:17:26 basenode kernel: [75064.199559] **************************************
Apr 13 15:17:26 basenode kernel: [75064.199559] Machine check error in DCC:1:
Apr 13 15:17:26 basenode kernel: [75064.199559]         Status = 0xf400000100000405
Apr 13 15:17:26 basenode kernel: [75064.199560]         Bank does not have any known errors
Apr 13 15:17:26 basenode kernel: [75064.199560]         Overflow (there may be more errors)
Apr 13 15:17:26 basenode kernel: [75064.199560]         Uncorrected (this is fatal)
Apr 13 15:17:26 basenode kernel: [75064.199561]         Error reporting enabled when error arrived
Apr 13 15:17:26 basenode kernel: [75064.199561]         ADDR = 0xb0
Apr 13 15:17:26 basenode kernel: [75064.199656] **************************************
Apr 13 15:17:26 basenode kernel: [75064.199657] **************************************
Apr 13 15:17:26 basenode kernel: [75064.199657] Machine check error in DCC:1:
Apr 13 15:17:26 basenode kernel: [75064.199657]         Status = 0xf400000100000405
Apr 13 15:17:26 basenode kernel: [75064.199658]         Bank does not have any known errors
Apr 13 15:17:26 basenode kernel: [75064.199658]         Overflow (there may be more errors)
Apr 13 15:17:26 basenode kernel: [75064.199659]         Uncorrected (this is fatal)
Apr 13 15:17:26 basenode kernel: [75064.199659]         Error reporting enabled when error arrived
Apr 13 15:17:26 basenode kernel: [75064.199659]         ADDR = 0xb8
Apr 13 15:17:27 basenode kernel: [75064.199660] **************************************

If we run the chef-client software manually, we can immediately see these errors being written to kern.log so it definitely happens as a result of running that software. These errors occur on all nodes of our 3 node cluster.

The chef-client is running an embedded ruby interpreter under the hood but I’m not sure how that could result in “Machine check” errors.

Please advise on the best path forward for resolving this issue as it is rather critical to our system build process.

snarky · April 13, 2018, 6:32pm

Does this happen on all the nodes, or just some particular hardware?
In general, this looks like damaged hardware or interference.

Also, it shouldn’t be specific to Chef; can you run other Ruby programs that do the same thing, and get the same error?
Do you know which particular step in the Chef execution is correlated with these errors?

assertadev · April 13, 2018, 8:06pm

These errors happen on all of the nodes. Sometimes it causes a panic with a reboot and other times it appears the errors are recovered. I would also assume hardware failure if it only happened on a subset of the nodes but the fact that it happens on all of them is interesting. Also, 1 of the nodes was purchased last year while the other 2 were purchased recently (last couple months?). I thought it may be a hardware defect in a certain lot but the fact that they were not all purchased at the same time leads be to believe otherwise.

I have run a simple command using the Chef embedded Ruby interpreter with no errors being reported. I will try to get more information on exactly what is going on during the offending Chef execution.

assertadev · April 13, 2018, 8:20pm

I think I have tracked it down to the Chef plugin Ohai. It is doing a shell call to dmidecode. If I run dmidecode from within a bash shell, I also see the CPU errors dumped to kern.log

Is this kernel patch applicable to this problem? [url]https://devtalk.nvidia.com/default/topic/1003952/dmidecode-crashes-r27-1-on-tx2/[/url]

linuxdev · April 13, 2018, 9:42pm

I can run dmidecode on R28.2 without issue:

# dmidecode 
# dmidecode 3.0
Scanning /dev/mem for entry point.
# No SMBIOS nor DMI entry point found, sorry.

A Jetson doesn’t have a BIOS, this is a component of a desktop PC motherboard…I’m unsure whether the Jetson just provides a “stub”, or if there is some alternative implementation, but it seems like it would be wrong to assume any output related to any BIOS content would be incorrect…don’t know. I suspect any patch to prevent crash is basically an intercept of what dmidecode does as a way to stop it from ever reading from non-existent hardware. Crashes are fairly typical of reading physical memory addresses for something which does not exist (or perhaps for hardware which is the equivalent of missing via being powered down).

snarky · April 14, 2018, 12:10am

I’m surprised they ship dmidecode on ARM. There’s no such thing as a DMI table.
Easiest fix would be to replace the dmidecode binary with an empty file or a shell script that does nothing, if there are other things you need from the ohai plugin.
Else just remove that plugin.

assertadev · April 17, 2018, 4:52pm

I can run dmidecode on R28.2 without issue:
# dmidecode 
# dmidecode 3.0
Scanning /dev/mem for entry point.
# No SMBIOS nor DMI entry point found, sorry.
A Jetson doesn’t have a BIOS, this is a component of a desktop PC motherboard…I’m unsure whether the Jetson just provides a “stub”, or if there is some alternative implementation, but it seems like it would be wrong to assume any output related to any BIOS content would be incorrect…don’t know. I suspect any patch to prevent crash is basically an intercept of what dmidecode does as a way to stop it from ever reading from non-existent hardware. Crashes are fairly typical of reading physical memory addresses for something which does not exist (or perhaps for hardware which is the equivalent of missing via being powered down).

We are also running on R28.2. Dmidecode doesn’t always cause a panic but it does always cause CPU errors in kern.log. I’m surprised you don’t see that same behavior as we see it consistently on 3 different devices.

I agree. I’m not sure why it is included in the distro repos if it is not applicable to the ARM architecture. For completeness sake, we ended up applying the patch referenced in https://devtalk.nvidia.com/default/topic/1003952/dmidecode-crashes-r27-1-on-tx2/ in the event there is another program that can cause the same behavior. Since applying the patch, all has been well in regards to running chef-client on our nodes.

snarky · April 17, 2018, 5:26pm

The reason it’s included even though it doesn’t make sense on the architecture, is that Linux is largely developed by a team of unpaid volunteers.