Strange problem with the eMMC during boot

Hi All!

One of my TX1s started to have an error messages during boot where I have problems finding the cause. They always look like this:

May 15 09:50:09 seavision-2 systemd[1]: Reached target Sound Card.
May 15 09:50:09 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p14
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p18
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p19
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p2
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p13
May 15 09:50:10 seavision-2 systemd-udevd[287]: Process '/bin/rm /var/lib/alsa/asound.state' failed with exit code 1.
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p10
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p9
May 15 09:50:10 seavision-2 systemd-udevd[279]: Could not generate persistent MAC address for dummy0: No such file or directory
May 15 09:50:10 seavision-2 systemd-udevd[282]: Could not generate persistent MAC address for ip6tnl0: No such file or directory
May 15 09:50:10 seavision-2 kernel: dhd_module_init in
May 15 09:50:10 seavision-2 systemd[1]: dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device: Dev dev-disk-by\x2dpartuuid-00000000\x2d0001\x2d0000\x2d6708\x2dbd5b00000000.device appeared twice with different sysfs paths /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p15 and /sys/devices/sdhci-tegra.3/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p8

The difference between boots is only, that the cause partition on the eMMC changes. So far I have seen problems with mmcblk0p15, mmcblk0p17 and mmcblk0p19.

The system in question is a TX1 on an Auvidea J120 running the standard L4T 28.2 kernel in a very simple rootfs (basically an ubuntu-base).

I never have seen this problem before on a Jetson and all the other Jetsons we are running (using the same system) don’t report any problems. It is absolutely possible, that something went very wrong with that particular system, but I am a bit clueless right now where to start looking.

Can someone point me in a direction to look?

Lets start with this:

May 15 09:50:10 seavision-2 systemd-udevd[279]: Could not generate persistent MAC address for dummy0: No such file or directory
May 15 09:50:10 seavision-2 systemd-udevd[282]: Could not generate persistent MAC address for ip6tnl0: No such file or directory

Device special files are not real files, but are instead a result of a running driver. Apparently the driver related to this is gone. Because systemd is looking for MAC address, then this implies networking is gone (you can’t run networking setup without network drivers).

Is it correct that you are experimenting with partitions? In R28.x and newer much of the boot content has become signed and unsigned or incorrect signature content is rejected. It isn’t possible to know what is going on without knowing specific details of exactly what is being changed in partitions.

Hi!

network is working fine, I have neither a dummy driver nor the ip6 tunnel device running (havent seen one of those on a Jetson for ages).

My problem is with the partitions. I really don’t do anything with that. What I do is using a custom rootfs inside a L4T 28.2 distribution. Then run apply_binaries.sh and flash.sh to flash the Jetson. So the partition table etc. is as provided by L4T.

Do I miss a step?

What I’m pointing out is that kernel drivers (even if unrelated) are apparently missing. Missing kernel features or device tree caused those drivers to fail. dummy0 and ip6tnl0 are all virtual, and thus there is probably no device tree related to them…which leaves kernel drivers as missing or misconfigured. I always have to wonder if the base kernel/module setup is installed correctly (it is hard to figure out what goes on if modules are missing…how do you debug a system with part of the kernel missing?).

The error is a complaint in sysfs of essentially multiple copies. Sysfs is itself a reflection in RAM created by various kernel components, e.g., drivers…so it is back again that I wonder if perhaps something invalid is going on in kernel config.

However, what do you see from:

sudo gdisk -l /dev/mmcblk0
# And:
lsblk -f

A default TX1 dev kit under R28.2 would show this for gdisk:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34        29859873   14.2 GiB    0700  APP
   2        29859874        29863969   2.0 MiB     0700  TBC
   3        29863970        29872161   4.0 MiB     0700  EBT
   4        29872162        29876257   2.0 MiB     0700  BPF
   5        29876258        29888545   6.0 MiB     0700  WB0
   6        29888546        29896737   4.0 MiB     0700  RP1
   7        29896738        29909025   6.0 MiB     0700  TOS
   8        29909026        29913121   2.0 MiB     0700  EKS
   9        29913122        29917217   2.0 MiB     0700  FX
  10        29917218        30179361   128.0 MiB   0700  BMP
  11        30179362        30220321   20.0 MiB    0700  SOS
  12        30220322        30351393   64.0 MiB    0700  EXI
  13        30351394        30482465   64.0 MiB    0700  LNX
  14        30482466        30490657   4.0 MiB     0700  DTB
  15        30490658        30494753   2.0 MiB     0700  NXT
  16        30494754        30507041   6.0 MiB     0700  MXB
  17        30507042        30519329   6.0 MiB     0700  MXP
  18        30519330        30523425   2.0 MiB     0700  USP
  19        30523426        30777310   124.0 MiB   0700  UDA

A TX2 would be quite different, and shows up like this:

Number  Start (sector)    End (sector)  Size       Code  Name
   1            4097        60047360   28.6 GiB    0700  APP
   2        60047361        60055552   4.0 MiB     0700  mts-bootpack
   3        60055553        60063744   4.0 MiB     0700  mts-bootpack_b
   4        60063745        60064768   512.0 KiB   0700  cpu-bootloader
   5        60064769        60065792   512.0 KiB   0700  cpu-bootloader_b
   6        60065793        60066816   512.0 KiB   0700  bootloader-dtb
   7        60066817        60067840   512.0 KiB   0700  bootloader-dtb_b
   8        60067841        60073984   3.0 MiB     0700  secure-os
   9        60073985        60080128   3.0 MiB     0700  secure-os_b
  10        60080129        60084224   2.0 MiB     0700  eks
  11        60084225        60085432   604.0 KiB   0700  bpmp-fw
  12        60085433        60086640   604.0 KiB   0700  bpmp-fw_b
  13        60086641        60087640   500.0 KiB   0700  bpmp-fw-dtb
  14        60087641        60088640   500.0 KiB   0700  bpmp-fw-dtb_b
  15        60088641        60092736   2.0 MiB     0700  sce-fw
  16        60092737        60096832   2.0 MiB     0700  sce-fw_b
  17        60096833        60109120   6.0 MiB     0700  sc7
  18        60109121        60121408   6.0 MiB     0700  sc7_b
  19        60121409        60125504   2.0 MiB     0700  FBNAME
  20        60125505        60387648   128.0 MiB   0700  BMP
  21        60387649        60649792   128.0 MiB   0700  BMP_b
  22        60649793        60715328   32.0 MiB    0700  SOS
  23        60715329        60780864   32.0 MiB    0700  SOS_b
  24        60780865        60911936   64.0 MiB    0700  kernel
  25        60911937        61043008   64.0 MiB    0700  kernel_b
  26        61043009        61044032   512.0 KiB   0700  kernel-dtb
  27        61044033        61045056   512.0 KiB   0700  kernel-dtb_b
  28        61045057        61569344   256.0 MiB   0700  CAC

You are absolutely right with regards to the kernel and the device drivers. Since I don’t use the dummy driver and IPV6 tunnel I never noticed that something could be off. Maybe I am a bit to la when it comes to the kernel complaining about stuff I don’t use, might be a relic from my old 2.0.x kernel development times…

A quick check on that front shows, that ipv6 and the dummy drivers are both directly compiled into the kernel.

# zgrep CONFIG_DUMMY /proc/config.gz
# CONFIG_DUMMY_IRQ is not set
CONFIG_DUMMY=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25

and

zgrep IPV6 /proc/config.gz
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
CONFIG_IPV6_MIP6=y
# CONFIG_IPV6_ILA is not set
# CONFIG_IPV6_VTI is not set
CONFIG_IPV6_SIT=y
# CONFIG_IPV6_SIT_6RD is not set
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=y
# CONFIG_IPV6_GRE is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
# CONFIG_IPV6_SUBTREES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_IP_VS_IPV6 is not set
CONFIG_NF_DEFRAG_IPV6=y
CONFIG_NF_CONNTRACK_IPV6=y
# CONFIG_NF_DUP_IPV6 is not set
CONFIG_NF_REJECT_IPV6=y
# CONFIG_NF_LOG_IPV6 is not set
# CONFIG_NF_NAT_IPV6 is not set
# CONFIG_IP6_NF_MATCH_IPV6HEADER is not set

And both not ass modules but directly in the kernel. So I have to dig deeper why udevd doesn’t find the driver.

with regards to the partition problem, here are the two outputs:

# gdisk -l /dev/mmcblk0
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/mmcblk0: 30777344 sectors, 14.7 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 00000000-0000-0000-0000-000000000000
Partition table holds up to 19 entries
First usable sector is 34, last usable sector is 30777311
Partitions will be aligned on 2-sector boundaries
Total free space is 1 sectors (512 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34        29360161   14.0 GiB    0700  APP
   2        29360162        29364257   2.0 MiB     0700  TBC
   3        29364258        29372449   4.0 MiB     0700  EBT
   4        29372450        29376545   2.0 MiB     0700  BPF
   5        29376546        29388833   6.0 MiB     0700  WB0
   6        29388834        29397025   4.0 MiB     0700  RP1
   7        29397026        29409313   6.0 MiB     0700  TOS
   8        29409314        29413409   2.0 MiB     0700  EKS
   9        29413410        29417505   2.0 MiB     0700  FX
  10        29417506        29679649   128.0 MiB   0700  BMP
  11        29679650        29720609   20.0 MiB    0700  SOS
  12        29720610        29851681   64.0 MiB    0700  EXI
  13        29851682        29982753   64.0 MiB    0700  LNX
  14        29982754        29990945   4.0 MiB     0700  DTB
  15        29990946        29995041   2.0 MiB     0700  NXT
  16        29995042        30007329   6.0 MiB     0700  MXB
  17        30007330        30019617   6.0 MiB     0700  MXP
  18        30019618        30023713   2.0 MiB     0700  USP
  19        30023714        30777310   368.0 MiB   0700  UDA

# lsblk -f
NAME         FSTYPE LABEL UUID                                 MOUNTPOINT
sda
└─sda1       ext4         b36f216f-ec68-4ff7-bfcc-bc2ad1861019 /XXXXX
mmcblk0rpmb
mmcblk0
├─mmcblk0p1  ext4         1b572b3d-0658-4ec1-8df3-9b449537e01f /
├─mmcblk0p2
├─mmcblk0p3
├─mmcblk0p4
├─mmcblk0p5
├─mmcblk0p6
├─mmcblk0p7
├─mmcblk0p8
├─mmcblk0p9
├─mmcblk0p10
├─mmcblk0p11
├─mmcblk0p12
├─mmcblk0p13
├─mmcblk0p14
├─mmcblk0p15
├─mmcblk0p16
├─mmcblk0p17
├─mmcblk0p18
└─mmcblk0p19

Now I wonder a bit, where the small differences in the table come from. Since I really didn’t change anything in that regard. It is pure L4T 28.2 for a TX1. So why is my APP partition 200MB smaller and concequently the user data partition at the end those 200MB larger. My flash environment is on an extrnal ssd I can’t access right now, so I will need to have a look into that tomorrow…

I have not examined what kernel features are required for dummy0, but if the features you listed are indeed the requirements for the kernel side of dummy0, then it implies something in the boot environment itself is missing (something set up by systemd/init steps since there is no related hardware and thus it is unlikely anything device tree got in the way…but you never know, there might be an inheritance from a step which was a hardware setup step).

On the other hand, the “DUMMY” configs found tend to imply some sort of console, and perhaps this is unrelated to networking dummy devices. Don’t know since I haven’t actually compared the kernel config items to what is required for the failed networking items.

In the case of a driver not finding its hardware, then this can sometimes be due to missing or incorrect firmware. The firmware essentially changes the nature of the hardware, and changing a driver API could imply the need to change to a matching new firmware. Not all hardware uses firmware, but wireless networking does more often than not (those who don’t use firmware must create new hardware to support different regulations throughout the different political regions of the world…else they can only sell the hardware in one location…this makes wireless firmware quite popular).

The size difference of APP is because during command line flash I specifically set to use the max possible APP size. When I flashed I added “-S 14580MiB”, whereas I think the default is “-S 14GiB”. “1458010241024” versus “1410241024*1024” (the byte difference is 255852544 bytes, or 244MiB). The important thing is the size of the non-rootfs partitions…these are the ones used for boot. These other partitions appear to match in size.

sudo ./flash.sh <b>-S 14580MiB</b> jetson-tx1 mmcblk0p1

Something the partition list does not show is if the partitions’ signatures are valid. Boot content does need to be signed, but so long as those other partitions were not manipulated (and thus changing signature), then the APP (rootfs) partition size changes should not be an issue.

Was there any manipulation or change to any of the non-APP partitions?