I hit the issue when creating users with ansible. Ansible basically runs some commands over ssh to configure the system. I’m a little stumped that this triggers an EL1 exception but am not too familiar with the sources for the Xavier.
Below is the simplified playbook and that still hits the issue.
After some sleuthing, I think I have a decent handle on what is causing the system to lockup.
As part of running a playbook, ansible inspects characteristics of the target system. This is needed to detect the correct platform and issue appropriate commands. Depending on configuration, ansible commands are run with root/sudo privileges as many tasks (e.g., create new user) require changes to system files.
As part of the inspection, ansible invokes dmidecode (shipped with ubuntu 16.04) on the target which throws the fatal fault reported here. I was able to reproduce the fatal fault by running dmidecode on Drive AGX. See the runlog at the end.
One way to prevent this kind of a crash is to disable /dev/mem or enable some of the stricter checks for access to it in the kernel config.
There are also a few /dev/mem related fixes to upstream to address these kinds of issues. It’s worth looking at backporting them.
dmidecode is a tool to get SMBIOS information which is available on ARM systems as well. It is part of Ubuntu 16.04 and installed by default on the filesystem the Drive AGX comes with.
The BIOS here refers to firmware interfaces - BIOS in the description is just a hangover from what it started out as. The same as assuming that it is safe to write to random memory locations - that is just irresponsible.
But having said that, there are very few real usecases that need /dev/mem access. Enabling it opens up a glaring big HOLE in your system security - imagine being able to ready/write to any location in memory irrespective of what it is being used for.
I’d highly recommend turning it off via kernel config.