Fatal fault locks up the Xavier SoC

punitagrawal · February 27, 2019, 8:04am

Hi,

I am hitting a fatal exception on the Drive AGX. The following system state is logged to the kernel log when the exception occurs -

Unsupported trap to 0x7f7ff37000: ISS.ISV = 0

Fatal VM Fault: Failed to decode faulting instruction!

VM State:
VCPU ID: 0x5
PC: 0x7f7fe5db40
SPSR_EL2: 0x20000000
FAR_EL2: 0x7f7ff37000
SPSR_EL1: 0x80000000
GPRs:
x0: 0x000000558441b010  x1: 0x0000007f7ff37000  x2: 0x000000000000ff80
x3: 0x0000000000000001  x4: 0x0000000000000000  x5: 0x00000000000f0000
x6: 0x000000558441b010  x7: 0x0000000000000003  x8: 0x00000000000000de
x9: 0x0000007f7ff269f8  x10: 0x0101010101010101 x11: 0x0000000000000020
x12: 0x00000000000003f3 x13: 0x0000000000000000 x14: 0x0000000000000000
x15: 0x0000007f7ff73000 x16: 0x0000005584419e58 x17: 0x0000007f7fe5da40
x18: 0x0000000000000a03 x19: 0x000000558441b010 x20: 0x000000558440284f
x21: 0x0000000000000003 x22: 0x0000000000010000 x23: 0x0000005584419000
x24: 0x00000000000f0000 x25: 0x0000000000000000 x26: 0x0000007f7ff37000
x27: 0x0000000000010000 x28: 0x0000000000000000 x29: 0x0000007fc926dc00
x30: 0x0000005584400360

Sidekick Stack Trace:
0x177c8
0x19734
0x1876c
0x188c0
0x12070
�[00764805] wdt: expired vmid 0
[00824805] wdt: expired vmid 0
[00884805] wdt: expired vmid 0
[00944805] wdt: expired vmid 0
[01004805] wdt: expired vmid 0
[01064805] wdt: expired vmid 0
[01124805] wdt: expired vmid 0
[01184805] wdt: expired vmid 0

The system locks up after the fault and the watchdog is triggered after sometime. The issue can be reliably triggered on both the Xaviers SoCs.

Feel free to ask for any details that might help resolve the issue.

WayneWWW · February 27, 2019, 8:47am

May I know the release version and how to reproduce this issue?

punitagrawal · February 27, 2019, 9:00am

The system is flashed with version 8.0.

I hit the issue when creating users with ansible. Ansible basically runs some commands over ssh to configure the system. I’m a little stumped that this triggers an EL1 exception but am not too familiar with the sources for the Xavier.

Below is the simplified playbook and that still hits the issue.

You can run the playbook by issuing -

ansible-playbook -i inventory -l drive add-users.yaml -vvvv

Hope this helps. Let me know if you need anything else.

---
- name: Create users
  hosts: all
  become: yes

  vars:
    users:
      
  tasks:
  - name: Add users
    user:
      name: '{{ item.name }}'
      # Password is intentionally left blank. If needed set it up on
      # the machine directly
      shell: '{{ item.shell }}'
      groups: '{{ item.groups }}'
      append: yes
    with_items:
      '{{ users }}'

WayneWWW · February 27, 2019, 9:48am

The system is flashed with version 8.0. → What does this version 8.0 mean???
Sdkmanager version? Linux version?

punitagrawal · February 28, 2019, 1:01am

Since you asked for the release version… I was referring to the NVidia Drive Software 8.0 mentioned in https://developer.nvidia.com/nvidia-drive-downloads.

Here’s the kernel and the distro version.

$ uname -a
Linux tyo-drive01-a 4.9.111-rt76-tegra #1 SMP PREEMPT RT Fri Dec 14 15:45:12 PST 2018 aarch64 aarch64 aarch64 GNU/Linux
$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"

Let me know if you want the version of some specific component to help reproduce the issue.

punitagrawal · March 1, 2019, 2:21am

After some sleuthing, I think I have a decent handle on what is causing the system to lockup.

As part of running a playbook, ansible inspects characteristics of the target system. This is needed to detect the correct platform and issue appropriate commands. Depending on configuration, ansible commands are run with root/sudo privileges as many tasks (e.g., create new user) require changes to system files.

As part of the inspection, ansible invokes dmidecode (shipped with ubuntu 16.04) on the target which throws the fatal fault reported here. I was able to reproduce the fatal fault by running dmidecode on Drive AGX. See the runlog at the end.

One way to prevent this kind of a crash is to disable /dev/mem or enable some of the stricter checks for access to it in the kernel config.

There are also a few /dev/mem related fixes to upstream to address these kinds of issues. It’s worth looking at backporting them.

$ sudo dmidecode
Scanning /dev/mem for entry point.
Unsupported trap to 0x7f8e5b7000: ISS.ISV = 0


Fatal VM Fault: Failed to decode faulting instruction!

VM State:
VCPU ID: 0x3
PC: 0x7f8e4bab40
SPSR_EL2: 0x20000000
FAR_EL2: 0x7f8e5b7000
SPSR_EL1: 0x80000000
GPRs:
x0: 0x0000005570d2e020  x1: 0x0000007f8e5b7000  x2: 0x000000000000ff80  
x3: 0x0000000000000001  x4: 0x0000000000000000  x5: 0x00000000000f0000  
x6: 0x0000005570d2e020  x7: 0x0000000000000003  x8: 0x00000000000000de  
x9: 0x0000007f8e5839f8  x10: 0x0101010101010101 x11: 0x0000000000000020 
x12: 0x00000000000003f3 x13: 0x0000000000000000 x14: 0x0000000000000000 
x15: 0x0000007f8e5f3000 x16: 0x0000005570d2be58 x17: 0x0000007f8e4baa40 
x18: 0x0000000000000a03 x19: 0x0000005570d2e020 x20: 0x0000005570d1484f 
x21: 0x0000000000000003 x22: 0x0000000000010000 x23: 0x0000005570d2b000 
x24: 0x00000000000f0000 x25: 0x0000000000000000 x26: 0x0000007f8e5b7000 
x27: 0x0000000000010000 x28: 0x0000000000000000 x29: 0x0000007ff2860510 
x30: 0x0000005570d12360 


Sidekick Stack Trace:
0x177c8
0x19734
0x1876c
0x188c0
0x12070

WayneWWW · March 4, 2019, 3:06am

Sorry that I just checked what is dmidecode and notice it has below description.

Dmidecode reports information about your system’s hardware as described in your system BIOS according to the SMBIOS/DMI standard (see a sample output).

I don’t think it is compatible with Drive Xavier because there is no BIOS.

punitagrawal · March 4, 2019, 6:47am

dmidecode is a tool to get SMBIOS information which is available on ARM systems as well. It is part of Ubuntu 16.04 and installed by default on the filesystem the Drive AGX comes with.

The BIOS here refers to firmware interfaces - BIOS in the description is just a hangover from what it started out as. The same as assuming that it is safe to write to random memory locations - that is just irresponsible.

But having said that, there are very few real usecases that need /dev/mem access. Enabling it opens up a glaring big HOLE in your system security - imagine being able to ready/write to any location in memory irrespective of what it is being used for.

I’d highly recommend turning it off via kernel config.

WayneWWW · March 4, 2019, 7:10am

Many thanks for pointing out for this. I’ll check with internal team if we could disable it.
Actually, I don’t think we cover all the 3rd-party tools released in our filesystem.