Custom kernel cannot boot from NVMe when $(uname -r) is changed

Dear community,

I am working on building custom Jetson Linux kernel natively.

The device: Jetson AGX orin (able to boot from NVMe normally).

JetPack on device: 6.0DP (flashed by SDK manager).

L4T sources used to build the kernel: 36.2.

I have tried 2 different builds:
#1: The sources used to build the kernel is of version 36.2 (same as the Linux running on the machine). I have followed the guide from Problem SMB Jetson Nano - #11 by linuxdev. /proc/config.gz is extracted as the configuration file to keep an exactly same configuration with the currently running system (have CONFIG_LOCALVERSION=-tegra set and $(uname -r) should be 5.15.122-tegra). I build only the kernel image, copy the file to /boot/Image-test, and add an entry in /boot/extlinux/extlinux.conf:

TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      FDT /boot/dtb/kernel_tegra234-p3737-0000+p3701-0005-nv.dtb
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/nvme0n1p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0 nospectre_bhb video=efifb:off console=tty0 

LABEL test
      MENU LABEL test kernel
      LINUX /boot/Image-test
      FDT /boot/dtb/kernel_tegra234-p3737-0000+p3701-0005-nv.dtb
      INITRD /boot/initrd
      APPEND ${cbootargs} root=/dev/nvme0n1p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0 nospectre_bhb video=efifb:off console=tty0 

Everything works just fine and the system is able to boot from NVMe from the new Image.

#2: However, when I change the CONFIG_LOCALVERSION so that it dose not match the current system, for example CONFIG_LOCALVERSION=-tegra-test ($(uname -r) should now be 5.15.122-tegra-test), and also build the modules, make modules_install them to the corresponding location at /lib/modules/5.15.122-tegra-test/, copy the kernel Image and modify the /boot/extlinux/extlinux.conf same as #1, the system now cannot boot from NVMe. The starup screen stops at:

ERROR: nvme0n1p1 not found

Here is the log:
boot-5.15.122-tegra-test-fail.log (58.8 KB)

Please note that for #2 build, I only built the kernel image and the modules (and install them by copying the files and make modules_install, avoiding any kind of flashing), and left anything else like dtbs, rootfs, initrd… untouched. In the meantime, NVMe driver was built into the kernel (not as separate modules): CONFIG_BLK_DEV_NVME=y.

Any idea what is going wrong? What are the correct steps to build the kernel if I wish to modify the CONFIG_LOCALVERSION (different from current running kernel)? Thanks for your help.

When the output of “uname -r” changes, then it means the search location for modules also changes. CONFIG_LOCALVERSION is part of that output and path.

If external media is not used, then the kernel is loaded from the same filesystem that has the new “/lib/modules/$(uname -r)/kernel” module location. The first question is: Did you put all modules in the new module location? The changes you are describing require putting all modules in that new path.

The fact that you built the driver into the kernel with “=y” is good. The kernel does not need this driver in an initrd (initial ramdisk, which acts as an adapter for more or less chain loading different root filesystems). However, for external media, you still need an initrd (e.g., with l4t_initrd_flash.sh; there is a README for this in “tools/kernel_flash/”).

An initial ramdisk has a copy of the boot-critical modules in it. For example, if the rootfs has an XFS filesystem, but the bootloader only understands ext4, then an initrd would have the XFS module in it which loads into that particular kernel. Then boot could continue, and the initrd filesystem (it is just a cpio archive in compressed RAM with a tree structure) would perform a pivot_root or equivalent the transplant the new “/” as the XFS filesystem.

In your case, if the initrd has any module requirements (beyond NVMe drivers), and if you did not update the initrd to contain the modules for this requirement such that they have been recompiled to work with your Image, then boot would fail. I can’t guarantee it, but I suspect the reason for failure is that the initrd is incomplete. You could put the modules in the correct place (look up the README) and initrd flash. This would cause modules needed for boot to go into the stage right before switching to NVMe.

Thank you very much for your guidance!

The first question is: Did you put all modules in the new module location? The changes you are describing require putting all modules in that new path.

In the first place, I think I have put all the modules in the right place. I installed the modules by make modules_install, and checked for the path /lib/modules/$(uname -r) (where $(uname -r) is the modified one).

In your case, if the initrd has any module requirements (beyond NVMe drivers), and if you did not update the initrd to contain the modules for this requirement such that they have been recompiled to work with your Image , then boot would fail.

I was thinking that the problems come from the initrd at first, but the weired thing is, in fact, I did not modify any of the modules during the custom builds, so the newly built model should be exactly same as my running system, except for I move the NVMe driver to the kernel. Actually I also tried not to build the NVMe driver into the kernel (so in this case the custom built kernel and modules should be truly exactly same as the running system, with just the different $(uname -r) and this also failed to boot). Thus, is it reasonbale for me to believe that using the existing initrd from the original system ought to be no problem? It is confusing that why does a working initrd not work for an exact copy of the kernel and modules. In this circumstance, I did not even understand why I should put the NVMe driver into the kernel since I did not change any configuration for the kernel except for its $(uname -r). BTW, I did this seemingly useless work just for learning and wish to understand the kernel-building workflow.

Thanks for your time!

In the above, keep in mind that the “uname -r” depends on the kernel which is currently running. If you cross compile on a separate host PC, then the command would show the host PC’s “uname -r”; if you native compile directly on the Jetson, then the command would show what the current kernel uses, not what the newly built kernel uses. Were you natively compiling directly on the Jetson? The “make modules_install” would do the right thing from native compile (this command knows the difference between the current kernel and the new kernel). You’d see the modules at the future/lib/modules/$(uname -r)/kernel” (or the same one if nothing changes in uname -r).

However, the boot chain on a Jetson is different than that on a desktop PC since there is no BIOS. Updating a kernel on a desktop PC (which standardizes boot) would likely result in triggering an initrd build and updating GRUB. On a Jetson this would add content to the module location on the disk, but it would fail to update the initrd. You’d have to intervene to update the initrd, or have flash build the new initrd.

If you know which initrd is used, then it is possible to manually unpack it, copy modules in, repack it, and put it in place if it is in the form of a file. If it is in a partition as binary data though, you’d have to flash it. Keep in mind that changing an initrd will change its size, and that a previous partition (if it uses a partition for this) may not be large enough (and all surrounding partitions would have to move). Furthermore, aside from the rootfs partition, all eMMC model partitions must be signed during flash, so a simple dd copy (even if the size is within limits) would fail without effort to properly sign it.

The act of changing the Image, which implies adding or removing any integrated “=y” feature/symbol (contrast with a module), can cause existing modules to fail to load. Think of the sum total of all “=y” integrated features as a function signature, and the modules are built to load into the ABI of that signature.

I just don’t know what is in your initrd, so I can’t really do more than speculate that there is a kernel module there which is required to load, and now cannot load. The original initrd can quite easily be a problem when changing a kernel and not updating the modules within the initrd. It is true that if you did not change the kernel’s “=y” (integrated) symbols, then the original initrd should work (if and only if your module changes are not related to the disk being booted as the rootfs; we’re assuming the boot target also remained constant).

In reality it is perhaps easier to build any initial boot requirements directly into the kernel, but that does not answer whether or not there were already other modules in the initrd that are required and might now fail to load due to that change. If at any time you changed a symbol to “=y”, and then installed modules such that they overwrote the previous modules (meaning the CONFIG_LOCALVERSION and kernel source version remain constant), then your previous modules are gone. Getting the new modules to load in the old kernel is just as problematic as getting the old modules to run in the new kernel.

I am going to suggest that you can back up and save any rootfs. Many of the flash procedures for external media won’t alter the external media, but there is a risk that a new initrd flash will in fact change something. I recommend that you use the official docs for your L4T release for finding out how to add your customized kernel to the host PC’s flash software, and then flashing again (an initrd flash). If all modules and the new kernel are all correctly in place, then this would generate a new initrd with all requirements.

Another twist to make this more difficult is that kernels can exist in two places: Either in “/boot” or in a partition. Normally the “/boot” version takes precedence, and the partition is a fallback. Sometimes something changes and the kernel actually being loaded might not be the one you think is loading. Before determining anything about which kernel is loaded you would need to have a full serial console boot log to tell you which one is used.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.