36.3 kernel/module modifications pre-flash

Follow-up to this post 👆. First of all, great release! I noticed a few fixes that weren’t necessary from 36.2 (like modifying nv_enable_remote.sh, and a couple of common kernel flags).

I followed these steps to flash 36.3 onto my Orin NX. Worked great! Once I logged onto the node, I validated the configurations I set with nconfig, and only the kernel changes stuck. The features I set as module either didn’t get set properly, or were overridden during the process.

I saw the logs mention something about copying oot_modules during sudo ./tools/l4t_flash_prerequisites.sh, so I’m not sure if that might have clobbered my config?

Anything I missed in my “script” to allow the modules to flash properly? Thanks for the help! cc @DaveYYY

I have no idea what the issue is.

These two are totally unrelated.

Hello, Dave!

The issue is that the nconfig changes that I made pre-flash didn’t all flash. Image flags did, but module flags did not. Hoping I just missed something during the steps that I captured in that gist.

I read the script.
So apparently you didn’t install anything you built before flashing, then of course it will not take effect…
You’d better try it step-by-step before putting it all together in a complicated script.

Did you ever read the document?
https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/Kernel/KernelCustomization.html

Thanks Dave! I had gone over that doc a few times, but missed that the Makefile in Linux_for_Tegra /source/kernel had a separate command for install 😅. I also just double checked, and I was wrong. The kernel (Y) flags didn’t make it over, either 😕. I added the following to my step-by-step script:

export INSTALL_MOD_PATH=~/nvidia/Linux_for_Tegra/rootfs/
sudo make install

After the flash, nothing seems to be set (unless I’m checking incorrectly too…). None of these are OOT modules, so not sure what I’m missing here.

$ REQUIRED_CONFIGS=(...)
$ for config in "${REQUIRED_CONFIGS[@]}"
do
    zcat /proc/config.gz | grep "${config}[ =]"
done

CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
# CONFIG_NET_CLS_BPF is not set
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=m
CONFIG_CGROUPS=y
CONFIG_CGROUP_BPF=y
CONFIG_PERF_EVENTS=y
CONFIG_SCHEDSTATS=y
# CONFIG_NETFILTER_XT_TARGET_TPROXY is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_SOCKET is not set
# CONFIG_BLK_DEV_THROTTLING is not set
# CONFIG_NET_CLS_CGROUP is not set
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_IP_SET is not set
# CONFIG_IP_VS_NFCT is not set
# CONFIG_IP_VS_PROTO_TCP is not set
# CONFIG_IP_VS_PROTO_UDP is not set
# CONFIG_IP_VS_RR is not set
# CONFIG_CRYPTO_SEQIV is not set
# CONFIG_XFRM_USER is not set
# CONFIG_INET_ESP is not set

You didn’t even install the kernel image itself…
That command only installs in-tree modules.

Also, you don’t run this after you build the kernel.
You run it before you build the kernel:

sudo ./apply_binaries.sh

Or everything is overwritten.

Aha! That’s the step that was clobbering things. I got the 2 mixed up. Importantly, I also needed to call sudo -E make install, which I didn’t realize -E ensured env are passed to the makefile properly. That was just installing to the running system! No rebooting that machine now 😳. I have progress, albeit seeing a new error with the updated steps.

****************************************************
*                                                  *
*  Step 2: Boot the device with flash initrd image *
*                                                  *
****************************************************
...
...
***************************************
*                                     *
*  Step 3: Start the flashing process *
*                                     *
***************************************
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Timeout
Device failed to boot to the initrd flash kernel. Please retrive the serial log during flashing to debug further.
Cleaning up...

Any ideas if I flubbed something?

Device failed to boot to the initrd flash kernel. Please retrive the serial log during flashing to debug further.

install:
	@echo   "================================================================================"
	@echo   "Installing $(KERNEL_SRC_DIR) sources"
	@echo   "================================================================================"
	install $(kernel_image) $(INSTALL_MOD_PATH)/boot/
	$(MAKE) \
		ARCH=arm64 \
		-C $(kernel_source_dir) $(O_OPT) \
		LOCALVERSION=$(version) \
		INSTALL_MOD_PATH=$(INSTALL_MOD_PATH) \
		modules_install
	@echo   "================================================================================"
	@echo   "Kernel and in-tree modules installed successfully."
	@echo   "================================================================================"

Linux_for_Tegra/kernel/Image will be copied into Linux_for_Tegra/rootfs/boot/Image during flashing.
So if you only install the later one, but not the former one, which is what this Makefile does, you will still be flashing the old stock kernel image, which is not compatible with your new modules.

I think what you’re saying is that copying the Image from source to the kernel dir is a critical step that I missed. I see that very clearly in the aforementioned docs. I guess the bit that I’m taking for granted there is that there are discrepencies between the Makefile in source (which the doc references) and source/kernel.

New error!!

Formatting APP parition done
Formatting APP partition /dev/nvme0n1p1 ...
tar -x -I 'zstd -T0' -pf /mnt/external/system.img  --checkpoint=10000 --warning=no-timestamp --numeric-owner --xattrs --xattrs-include=*  -C  /tmp/ci-leeAYmGC8W
...
...
tar: Read checkpoint 660000
tar: Read checkpoint 670000
tar: Read checkpoint 680000
writing item=17, 9:0:secondary_gpt, 61203267072, 16896, gpt_secondary_9_0.bin, 16896, fixed-<reserved>-0, d00d42428d1f69e5113cb125943da97628f5d8e8
[ 140]: l4t_flash_from_kernel: Successfully flash the external device
[ 140]: l4t_flash_from_kernel: The device size indicated in the partition layout xml is smaller than the actual size. This utility will try to fix the GPT.
[ 140]: l4t_flash_from_kernel: Error flashing qspi
Flash failure
Either the device cannot mount the NFS server on the host or a flash command has failed. Debug log saved to /tmp/tmp.DpLrValHiX. You can access the target's terminal through "sshpass -p root ssh root@fc00:1:1:0::2" 
Cleaning up...

And here’s the output of the logfile it mentions

$ sudo cat /tmp/tmp.DpLrValHiX
[sudo] password for dudo: 
showmount -e
Export list for shit-box:
/home/dudo/nvidia/Linux_for_Tegra/rootfs                    fc00:1:1::/48
/home/dudo/nvidia/Linux_for_Tegra/tools/kernel_flash/images fc00:1:1::/48
/home/dudo/nvidia/Linux_for_Tegra/tools/kernel_flash/tmp    127.0.0.1
systemctl status nfs-kernel-server
● nfs-server.service - NFS server and services
     Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled; vendor preset: enabled)
     Active: active (exited) since Thu 2024-05-09 07:50:39 PDT; 3min 1s ago
    Process: 604518 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
    Process: 604519 ExecStart=/usr/sbin/rpc.nfsd (code=exited, status=0/SUCCESS)
   Main PID: 604519 (code=exited, status=0/SUCCESS)
        CPU: 8ms

May 09 07:50:39 shit-box systemd[1]: Starting NFS server and services...
May 09 07:50:39 shit-box systemd[1]: Finished NFS server and services.
blkid
/dev/nvme0n1p5: UUID="27459e81-5e23-4361-a569-e470acfdff76" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="6d24d904-705f-4abd-98d5-3b56646009d9"
/dev/loop1: TYPE="squashfs"
/dev/nvme0n1p3: BLOCK_SIZE="512" UUID="2202B21902B1F1C1" TYPE="ntfs" PARTLABEL="Basic data partition" PARTUUID="a2f38e46-0e11-4926-a93a-95da6cb18ad5"
/dev/nvme0n1p1: LABEL_FATBOOT="SYSTEM" LABEL="SYSTEM" UUID="48B1-9ED7" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI system partition" PARTUUID="62780323-d569-4bb1-bfd1-c5b8d5df3d27"
/dev/nvme0n1p4: LABEL="Recovery" BLOCK_SIZE="512" UUID="2CAAB272AAB23862" TYPE="ntfs" PARTLABEL="Basic data partition" PARTUUID="6e6ab516-4067-421c-964f-bc9464e01edd"
/dev/loop17: TYPE="squashfs"
/dev/loop15: TYPE="squashfs"
/dev/loop6: TYPE="squashfs"
/dev/loop13: TYPE="squashfs"
/dev/loop4: TYPE="squashfs"
/dev/loop11: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop0: TYPE="squashfs"
/dev/loop9: TYPE="squashfs"
/dev/loop16: TYPE="squashfs"
/dev/loop7: TYPE="squashfs"
/dev/loop14: TYPE="squashfs"
/dev/loop5: TYPE="squashfs"
/dev/loop12: TYPE="squashfs"
/dev/loop3: TYPE="squashfs"
/dev/loop10: TYPE="squashfs"
/dev/nvme0n1p2: PARTLABEL="Microsoft reserved partition" PARTUUID="de4ad5db-51c8-4d01-bbd4-3c0e626ee243"
df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.6G  2.0M  1.6G   1% /run
/dev/nvme0n1p5  288G   44G  229G  17% /
tmpfs           7.8G     0  7.8G   0% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
efivarfs        192K   46K  142K  25% /sys/firmware/efi/efivars
/dev/nvme0n1p1   96M   34M   63M  35% /boot/efi
tmpfs           1.6G   92K  1.6G   1% /run/user/1000
Checking folder /home/dudo/nvidia/Linux_for_Tegra/rootfs
drwxr-xr-x 18 root root   4096 May  9 00:06 rootfs
Checking folder /home/dudo/nvidia/Linux_for_Tegra/tools/kernel_flash/images
drwxr-xr-x  4 root root  4096 May  9 07:49 images
-rwxrwxrwx  1 dudo dudo 19233 Apr 24 20:06 l4t_create_images_for_kernel_flash.sh

Seems to be ample room on the flashing host, and the Jetson.

Get your serial console log.

Hi, I stumbled on your GIST and trying to install LP 36.3 on my orin NX, also for Kubernetes.

I have the same issue:

[ 140]: l4t_flash_from_kernel: Error flashing qspi
Flash failure

I did ssh to the Orin (Like that script does), and I found out that /dev/mtd0 isn’t there. In fact, the QSPI driver module doesn’t load at all (Version mismatch?):

[   11.769312] nvpps: disagrees about version of symbol module_layout
[   11.775188] tegra_mce: disagrees about version of symbol module_layout
[   11.777282] spi_tegra210_quad: disagrees about version of symbol module_layout

And trying to load it:

-bash-5.1# modprobe spi-tegra210-quad 
modprobe: ERROR: could not insert 'spi_tegra210_quad': Exec format error

Something wrong about those modules. As a result, QSPI can’t be used to flash the internal memory. Now, I don’t know if its because they didn’t get cross compiled for arm, or, they weren’t built against the right kernel version?

My NVMe gets flashed however.

Edit 1:
https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/Kernel/KernelCustomization.html

I am now following the Out of tree modules section. It does contain that qspi module that I had issues with, causing the QSPI flash to fail. I am now waiting for this build to complete.

Here’s the command I do just before putting the orin in recovery mode:

# From Linux_for_Tegra/source
export KERNEL_HEADERS=$PWD/kernel/kernel-jammy-src
make modules

# Your script already sets INSTALL_MOD_PATH, the next line depends on it
make modules_install
cd ..
./tools/l4t_update_initrd.sh

Edit 2:

Ok, the flash completes, and the Orin boots fine!!! I’ve been trying to flash it with SDKManager, but it wouldn’t go beyond that:

*  Step 3: Start the flashing process *
*                                     *
***************************************
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
Waiting for target to boot-up...
(Until timeout)

So I just assumed that SDKManager is not compatible with flashing an Orin in a Turing PI 2, using NVMe external storage.

Your script really help me out upgrade to LP 36, so I really hope this post will help you out as well!

FYI, here’s the result about those module (I missed them while menuconfig-ing the kernel):

francois@orin-nx-1:~$ for config in "${REQUIRED_CONFIGS[@]}"; do     zcat /proc/config.gz | grep "${config}[ =]"; done
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=m
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=m
CONFIG_CGROUPS=y
CONFIG_CGROUP_BPF=y
CONFIG_PERF_EVENTS=y
CONFIG_SCHEDSTATS=y
CONFIG_NETFILTER_XT_TARGET_TPROXY=m
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
CONFIG_NETFILTER_XT_MATCH_SOCKET=m
CONFIG_BLK_DEV_THROTTLING=y
CONFIG_NET_CLS_CGROUP=y
# CONFIG_CGROUP_NET_PRIO is not set
CONFIG_IP_SET=y
CONFIG_IP_VS_NFCT=y
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_RR=m
CONFIG_CRYPTO_SEQIV=y
CONFIG_XFRM_USER=y
CONFIG_INET_ESP=y

Edit 3:

While SDKManager ALWAYS gets stuck waiting for target to boot up (I tried 5 times in a row), the manual flashing procedure stuck there sometimes (Like 40% of the time or so). When that happens, I can either repeat the last command, or, just do this to skip to that same part again quickly:

./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1   -c tools/kernel_flash/flash_l4t_external.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml"   --showlogs --network usb0 --erase-all --flash-only --keep jetson-orin-nano-devkit internal

So far, when I got the “Waiting for target to boot up” issue, re-trying always worked.

Finally, after the Orin reboots with the flashed kernel, don’t run:

apt update
apt -y upgrade

until you have put the kernel packages on hold:

apt-mark hold nvidia-l4t-kernel*

Otherwise, an updated version of the original kernel will overwrite all the changes required for cilium.