Hey @WayneWWW this really makes a lot of sense with the distinction of “mount rootfs” and “boots from” and your summary of the process. I can see now that although its really nice having the rootfs on a bigger disk, updating the kernel down the road could introduce some issues. I’ve attached the boot log from when I had my USB drive attached directly to the Jetson with an exact mirror of my NVMe drive (only difference on the USB drive is in /boot/extlinux/extlinux.conf where root=/dev/sda1 instead of root=/dev/nvme0n1p1 or root=/dev/mmcblk0p1). The NVMe drive ultimately had its rootfs mounted instead of the USB drive and I’m assuming the kernel on the eMMC was booted. Thanks for taking a look into this. I would definitely rather have the kernel and rootfs on the same partition (“boots from” + “mount rootfs” team-up).
Was reading through a thread over here and made me think… based on my review of the uart logs and your explanation I believe the /boot/extlinux/extlinux.conf file is being read from the Jetson’s built in eMMC (and not from the USB or NVMe drive like I want). That being said, what would happen if the file was changed so that linux and initrd both pointed to the NVMe like below? Would this work or is the /dev/nvme0n1p1 not accessible at this point? Also wondering now if this is what the /boot/extlinux/extlinux.conf file on the NVMe is also supposed to look like in order to properly locate the kernel and ramdisk?
Just tested this using the USB drive and it doesn’t have any effect. Still fails the USB section with “Cannot open partition kernel” and moves on to “boot from” the Jetson’s built-in eMMC (kernel) and mounts the root file system (rootfs) from the NVMe.
Just successfully updated the kernel, bootloader, device tree, and a whole bunch of other stuff following the advice in the response below. Simply hit “Reboot later” after all the updates installed, then copied the /boot on the NVMe to the /boot on the Jetson’s built-in eMMC, and then rebooted the Jetson. Obviously this is not the most ideal path and I’m still looking forward to us figuring out the boot issues so that the /boot folder (kernel, ramdisk, device tree, and so on) that’s actually booted from can live on the NVMe as well.
Okay now things are getting really interesting. I just updated the kernel (and other stuff) and then mirrored to the built-in eMMC as stated in my previous message. So basically the eMMC and the NVMe are both totally updated. Now I decided to stick in the USB drive (old/non-updated kernel and old/non-updated everything else) and reboot the Jetson to see if any of the new updates had fixed the USB problems (unlikely I guess since CBoot lives elsewhere and I think is the issue?). So the boot process starts and the USB section once again shows the standard “Cannot open partition kernel” but to my surprise as the boot continues the system continuously crashes with “Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00”. If I unplug the USB drive while CBoot is counting down to autoboot then the system boots up as normal (“boots from” eMMC and “mounts rootfs” from NVMe). I attached the UART log from when I allowed it to panic once and then pulled the USB and let it boot successfully. I would expect this if I was trying to mount the rootfs from the USB and the kernel didn’t match the built-in eMMC… which I’m not even doing (I’m mounting the rootfs from the NVMe). I’m even more confused as to why this is happening since the USB section appears to fail as if the USB is skipped. So why does a USB (that supposedly has no partition kernel) cause a kernel panic???
Well… I just checked your uartlog.txt posted yesterday from #15.
This is the first time you posting a full log here, so I didn’t notice this before.
Even when the cboot tries to read the extlinux.conf from your emmc, there is a same error as usb boot case. If such case happens, our cboot will initiate a fallback mechanism to read kernel from partition. That is, during the flash process (by flash.sh), our tool not only installs the /boot to your rootfs but also flash a backup kernel into specific partition. When boots up fails, it will use the kernel partition to boot instead.
[0005.637] I> ########## Fixed storage boot ##########
[0005.642] I> Already published: 00010003
[0005.646] I> Look for boot partition
[0005.649] I> Fallback: assuming 0th partition is boot partition
[0005.655] I> Detect filesystem [0005.670] I> ext4_mount:588: Failed to allocate memory for group descriptor [0005.671] E> Failed to mount file system!!
[0005.671] E> Invalid fm_handle (0xa06964a8) or mount path passed
[0005.675] I> Fallback: Load binaries from partition
[0005.679] W> No valid slot number is found in scratch register
[0005.685] W> Return default slot: _a
[0005.689] I> A/B: bin_type (37) slot 0 [0005.692] I> Loading kernel from partition
I think that explains why your kernel update method does not work when you try to update it in built-in emmc… because the boot process does not read it at all.
Even in your latest fine boot result from nvme drive. The kernel is still from the partition.
[0005.637] I> ########## Fixed storage boot ##########
[0005.642] I> Already published: 00010003
[0005.646] I> Look for boot partition
[0005.649] I> Fallback: assuming 0th partition is boot partition
[0005.655] I> Detect filesystem
[0005.670] I> ext4_mount:588: Failed to allocate memory for group descriptor
[0005.671] E> Failed to mount file system!!
[0005.671] E> Invalid fm_handle (0xa06964a8) or mount path passed
[0005.675] I> Fallback: Load binaries from partition
[0005.679] W> No valid slot number is found in scratch register
[0005.685] W> Return default slot: _a
Thus, I think we should firstly check why even the emmc boot fails from the beginning… Is it a pure image from sdkmanager?
Yep everything on the eMMC drive was direct from the sdkmanager.
While I was getting the host set up I had booted up the Jetson just to poke around (so at that point it had the L4T that came pre-installed)
Then I flashed the Jetson from the host using the desktop UI and ran the Jetson like that for a bit while contemplating storage issues and “modern” software spreading all over the file system
Then I cloned the eMMC APP partition to several other drives (UFS cards, USB drives, and NVMe) while testing booting (but never altered the eMMC)
Then yesterday I used the flash utility from the host to set the rootfs to the NVMe (which as a byproduct wiped my eMMC except for the boot folder and even that was slimmed down as compared to my NVMe clone of the original /boot folder) and I didn’t mention this before but I noticed after the flash (that switched the rootfs target to the NVMe) that my /boot/Image and /boot/initrd were no longer the same from the eMMC to the NVMe (although similar in size diff revealed that they were not identical anymore)
Then finally today I updated the kernel, bootloader, and so on through the software update utility and then cloned the boot directory from the NVMe to the eMMC (altering it from its original sdkmanager flashed form)
Wow, okay so my brain is still processing all this. I don’t have the Jetson up right now so I’m just talking out loud. Is it possible to convert the working kernel partition into an image that I can replace /boot/Image with on my eMMC (and NVMe for that matter)? How is the backup kernel partition working when seconds before in the flash process it would have dumped the same thing to the APP partition (although apparently corrupted?)? How is the backup kernel partition working after I updated the kernel in the OS? Wouldn’t there be a mismatch in versions or did this backup kernel partition on the non-rootfs drive (eMMC) somehow get updated by the OTA updates? If the backup kernel partition gets modified by the OTA updates then why doesn’t the /boot folder on the eMMC also get updated (negating the need for me to copy the updates from the NVMe /boot before rebooting… although I guess if the kernel on the eMMC APP partition was never read from then my copy operation was more or less ignored in this case… but still how did the kernel partition on the eMMC get updated with the OTA updates then… did it not get updated… it says it’s updated)? I’m quite confused as to the connotations of these UART findings
Also not sure what this means… my kernel was updated successfully without issue. Which is why I’m further confused as to how the OTA updates could have modified this backup kernel partition but left the eMMC APP partition alone (although maybe this is because the system didn’t know about the APP partition kernel location but did know about the location of the working and booted backup kernel partition… and hence updated that)
IMO, firstly need to ask if formatting the emmc with sdkm is an option for you? I would like to deal with why even sdkm would cause file system error here. I don’t think you should dig into those kernel back up or something else when the system is already messed up.
As for your question, I think OTA would update both of them. But OTA would not expect cboot not able to read your file system.
So didn’t the flash I performed yesterday (to switch the rootfs to the NVMe) technically reformat the entire eMMC and re-write all the partitions and everything in there? I’m pretty sure I saw that happening in the log on the host. And I had USB issues before that and continue to have them now after that. Could it be that the images/packages that the sdkm downloaded are themselves partially corrupted (still confused about the working backup kernel partition and corrupted /boot/Image if they came from the same place)? Does the sdkm do like a checksum on the original downloads?
So didn’t the flash I performed yesterday (to switch the rootfs to the NVMe) technically reformat the entire eMMC and re-write all the partitions and everything in there?
Yes, it should. Sorry about that. I am used to switching between topics here so may forget about what you’ve tried after reading too many topics from others…
So far, I guess the case is when “cboot” tries to read file system from any storage on your side, it has problem. However, when “kernel” tries to read same file system, it passes.
I have such guess because it sounds like all the file systems here for usb/nvme are cloned from emmc, right? So if emmc is corrupted from the beginning, it happens to usb too.
Please give me sometime to discuss with internal team. Engineers are from different timezone so need your kind patience.
In the meanwhile waiting for their feedback, could you also try to remove the driver package installed by your sdkm and let it download/ flash again? I mean removing the BSP on host side and let sdkm do a clean download again.
or you can also directly download the BSP and rootfs and set it up manually from our dlc. (no sdkm required and no need to remove anything, it is just separate files)
The “quick start guide” on the page would also teach you the steps.
Okay back at it. Connection is a bit spotty as I’m currently traveling but I’ll re-download everything as soon as I’m able. In the mean time I generated the md5 checksums for all files the sdkm already downloaded before. If anyone has time to run the same command and post your results I would really appreciate it.
For those wondering what this is all about, comparing these md5 checksums with md5 checksums from known working downloads will help me identify corrupted downloads on my end (with only the very rare exception of hash collisions and assuming other people do not also have corrupted downloads).
Command: need to change hogank and/or download location if different
find /home/hogank/Downloads/nvidia/sdkm_downloads/ -type f -exec md5sum “{}” + > /home/hogank/Desktop/sdkm_downloads_md5.txt
Results: located in /home/hogank/Desktop/sdkm_downloads_md5.txt (removed all the path prefixes below for brevity)
Found the sha1 checksums for the latest release (undocument but mirrors previous releases listed in the Jetson Download Center) at this url. If anyone prefers sha1 for any reason I’ve updated the command and results below. The sha1 checksums listed for the release only contain a subset of the files the sdkm downloaded so I can only compare a few. I would still appreciate it if someone would run this on their end and post their results.
Command: need to change hogank and/or download location if different
find /home/hogank/Downloads/nvidia/sdkm_downloads/ -type f -exec sha1sum “{}” + > /home/hogank/Desktop/sdkm_downloads_sha1.txt
Results: located in /home/hogank/Desktop/sdkm_downloads_sha1.txt (again removed all the path prefixes below for brevity)
And a quick comparison shows that both of the following files share an identical sha1 checksum (computed on my host) with the list of sha1 checksums published with the release.
Okay diving into a comparison of the official L4T Driver Package (BSP) - Tegra186_Linux_R32.5.1_aarch64.tbz2 to my pre-existing Linux_for_Tegra directory (excluding rootfs for now since this really isn’t in the BSP) (for me at /home/hogank/nvidia/nvidia_sdk/JetPack_4.5.1_Linux_JETSON_AGX_XAVIER/Linux_for_Tegra) I found that all the 436 files from Tegra186_Linux_R32.5.1_aarch64.tbz2 were in my Linux_for_Tegra directory. 4 files common to both areas came back with different checksums (possibly modified during the sdkm flash process?) and 46 new files (beyond the 436 files common to both areas) only existed in my Linux_for_Tegra directory (presumably were files generated by the sdkm during flashing). I’ve listed the breakdown below and will move on to the rootfs next.
4 files common to Tegra186_Linux_R32.5.1_aarch64.tbz2 and Linux_for_Tegra with different checksums - presumably modified by the sdkm during flashing
/bootloader/adsp-fw.bin
/bootloader/eks.img
/bootloader/nvtboot_applet_t194.bin
/bootloader/spe_t194.bin
46 files found only under my pre-existing Linux_for_Tegra directory - presumably generated by the sdkm during flashing