Cannot login to GUI after updating kernel Image and stripped nvgpu.ko

Dear Experts,

I got a login issue after updating the kernel Image (that includes a modified built-in driver) just similar to Nvidia jetson xavier nx Gui login issue :

On host PC after re-compiling the kernel :

  • Strip the nvgpu.ko
    ./work/common/gcc/bin/aarch64-buildroot-linux-gnu-strip --strip-unneeded ./work/xavier/Linux_for_Tegra/kernel/l4es/kernel/debian/l4es-nvidia-l4t-kernel/lib/modules/5.10.104/kernel/drivers/gpu/nvgpu/nvgpu.ko -o ~/Downloads/nvgpu.ko

  • Copy new kernel Image to board :
    scp ./work/xavier/Linux_for_Tegra/kernel/l4es/kernel/debian/l4es-nvidia-l4t-kernel/boot/Image rtr@192.168.55.1:/tmp
    scp ./work/xavier/Linux_for_Tegra/kernel/l4es/kernel/debian/l4es-nvidia-l4t-kernel/boot/Image.t19x.sig rtr@192.168.55.1:/tmp

  • Copy stripped nvgpu.ko to Jetson board :
    scp /home/rtr/Downloads/nvgpu.ko rtr@192.168.55.1:/tmp

On Jetson board :

  • Update kernel image :
    $ sudo su
    '# cd /boot/
    '# cp Image Image.bak
    '# cp Image.sig Image.sig.bak
    '# cp /tmp/Image Image.new
    '# cp /tmp/Image.t19x.sig Image.sig.new
    '# cp /tmp/Image Image
    '# cp /tmp/Image.t19x.sig Image.sig

  • Update nvgpu.ko :
    '# cd /lib/modules/5.10.104-tegra/kernel/drivers/gpu/nvgpu/
    '# cp nvgpu.ko nvgpu.ko.bak
    '# cp /tmp/nvgpu.ko nvgpu.ko.new
    '# cp /tmp/nvgpu.ko nvgpu.ko

  • Reboot the board then it stuck. There was many failures β€œFailed to mount Arbitrary Executable File Formats” and others

If I revert the Image/Image.sig and the nvgpu.ko to their backup :
'# cd /boot/
'# cp Image.bak Image
'# cp Image.sig.bak Image.sig

'# cd /lib/modules/5.10.104-tegra/kernel/drivers/gpu/nvgpu/
'# cp nvgpu.ko.bak nvgpu.ko

then it boots and prompt the login windows normally.

Could you help to point out the missing step(s), please ?

Debug log :
JetsonXavierNX_NoGUI.log (177.8 KB)

Boot screen : IMG_2865.MOV - Google Drive

Thanks in advance and best regards,

Khang

If you cross compiled you may have created something for the desktop PC architecture, and not for the Jetson’s 64-bit ARM architecture. On your host PC, if you find the Image file you used, what do you see from:
file Image
(it is the ARM designation we are looking for, usually listed as β€œAarch64”; any mention of x86/amd64/x86_64 implies failure is guaranteed)

Hi @linuxdev,

Thanks for your advice. I think that it may be due to the fact that the cross-compile framework does not export LOCALVERSION or the CONFIG_LOCALVERSION in the kernel defconfig is not set to β€œ-tegra” while all the previous kernel modules (as well as the newly compiled nvgpu.ko) are found in /lib/modules/5.10.104-tegra.

I will check again and let you know.

Best Regards,
Khang

If CONFIG_LOCALVERSION is wrong, then 100% of all modules must be put in to the new module location. If your new kernel only changes modules, then you would want to keep the original -tegra CONFIG_LOCALVERSION. If your new kernel changes integrated features, then you want a new CONFIG_LOCALVERSION (this does not mean empty, but it does mean something other than -tegra). I’m not sure this would result in an exec format error, but it might.

With your new kernel running, what do see from β€œuname -r”? Modules would need to be located at:
/lib/modules/$(uname -r)/kernel/

Hi @linuxdev,

With your new kernel running, what do see from β€œuname -r”?


[09:24:50:760] rtr@rtr:~$ uname -r␍␍␊
[09:24:53:936] 5.10.104␍␍␊

[09:24:53:936] rtr@rtr:~$ uname -r␍␍␊
[09:25:11:309]  Linux rtr 5.10.104 #4 SMP PREEMPT Thu Jun 13 11:19:06 +07 2024 aarch64 aarch64 aarch64 GNU/Linux␍␍␊

But there’s only β€˜5.10.104-tegra’ folder under /lib/modules/ :

[09:25:11:309] rtr@rtr:~$ ls /lib/modules/␍␍␊
[09:33:24:029] 5.10.104-tegra␍␍␊

Regards,
Khang

You failed to set CONFIG_LOCALVERSION. This serves as the suffix to the command β€œuname -r”. If your kernel release is 5.10.104, and you have no CONFIG_LOCALVERSION, then β€œuname -r” is β€œ5.10.104”. If CONFIG_LOCALVERSION is β€œ-tegra”, then β€œuname -r” becomes β€œ5.10.104-tegra”. The first issue is that none of your drivers that are modules can be found.

The second issue is that sometimes versions, and the specification of versions, matter. If you simply moved (or used a symbolic link) the -tegra content to a 5.10.104/ directory, then some or all of the modules would still fail to load.

The question becomes β€œwhat to do for CONFIG_LOCALVERSION”? If you are building a full kernel, and if the integrated features (meaning β€œ=y”) do not change, but the modular features (meaning β€œ=m”) change, then you would want the same β€œuname -r”. You’d avoid copying the kernel Image (the integrated features file), and copy only modules to the existing β€œ/lib/modules/5.10.104-tegra/kernel/” location (the proper subdirectory thereof).

If you intend to compile with different integrated features (a change to β€œ=y” content), then you would want a new β€œuname -r”, which means you set a CONFIG_LOCALVERSION, but you use an alternate name. An example would be β€œ-test” or β€œ-new”. Often I name the CONFIG_LOCALVERSION after something the new feature provides. For example, if I add some iSCSI feature, maybe I would name it β€œCONFIG_LOCALVERSION=-iscsi”. This does not mean you wouldn’t start with a config that matches the original kernel, but it does imply something β€œ=y” changed. In this case then you would install the kernel Image file itself and all modules to the new location of β€œ/lib/modules/5.10.104-test/kernel/”.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.