LightDM freezes after kernel replacement

After applying some kernel patches and replacing the kernel image on the TX2, I am experiencing issues with lightDM. The system boots all the way to the lightdm login screen, but the display, keyboard, and mouse freeze after a few seconds. If I can login before it freezes, I am left with a stable system, showing the background and mouse but no window manager.

If left on the frozen login screen, it reboots after a few minutes.

Ctrl-Alt-F(1-6) freeze the system immediately, and do not present a terminal.
When I can get past the login screen, I am unable to get Ctrl-Alt-T to bring up a terminal.
Once it is frozen, it will restart if given the Alt-SysRq-R-E-I-S-U-B magic command.

On a separate TX2, I was left in a similar broken window manager state after patching the kernel, but was a able to get a terminal from Ctrl-Alt-T, and restore the system by nuking the compiz cache. However, that doesn’t seem to be an option in this case.

Does anyone have any thoughts or insights? Is a serial console my only option to get a terminal, if not by any other means?

Since magic sysrq is working it implies at least some basic part of the system is still working. Exploring whatever is wrong will either require ssh to work, or else serial console.

If it does turn out that you can reach a terminal, then you should see if this command shows all “ok” for files:

sha1sum -c /etc/nv_tegra_release

It seems I was able to get over to a virtual terminal by way of a alt-printscr-r + ctrl-alt-f1 at just the right moment during bootup - this lets me at least peak at the kernel ring, but it doens’t seem to ever get as far as a login prompt. I am seeing a number of lines along the lines of “rcu_preempt detected stalls on CPUs/tasks:”, followed by a call / task dump.

Not sure what I did to break something, but I guess that is the nature of playing with the kernel. Kernel was built using the same source version and .config from the device.

Some reading suggests that the stall may be overcome by force-nice’ing all real-time tasks (alt-printscr-n), however, the system reports failing to stop a CPU and reboots after 5 seconds.

Clearly, something is significantly broken. Perhaps by not including modules, I messed up a link somewhere - I only copied in the image / zImage.

Will try pushing the old kernel image back on with the flash tool.

On the 64-bit L4T releases zImage is not used, only Image.

I am not sure if I am interpreting this correctly, but did you use the original “/proc/config.gz”, and then edit this to be all integrated features and no modular features? If so, then this would probably explain a failure since some features must be a module.

If “by not including modules” means the config is the same, but no modules exist in “/lib/modules/$(uname -r)/”, then this would definitely cause a failure.

On the other hand, if you built the kernel strictly based on “/proc/config.gz”, and “/lib/modules/$(uname -r)/” remains the same (along with the actual “uname -r”), then you should be able to add any modular feature desired by adding modules to that directory (the Image file itself would not even need to be replaced).

Can you give more details in exactly what was changed in config, exactly which file(s) changed, so on?

Mostly the third one - rebuilt a kernel image to install on a working system with modules still in place and the same local_version. This was related to the kernel patches for the Intel Realsense D435 which we discussed on here: https://devtalk.nvidia.com/default/topic/1039371/auvidea-j120-and-intel-realsense-d435/#5290998

The some source was patched (USB UVC drivers) and some additional modules related to Industrial I/O were enabled, and marked to be compiled into the kernel, not as external modules.

I had success making this change and installing it with Linux 4.4.38-tegra on L4T 28.2.1 / Jetpack 3.2, but had this failure attempting to do the same with an existing system on 4.4.38-tegra on L4T 28.1 / Jetpack 3.1.

From the looks online, it seems that the 28.1 kernel version is supposed to be 4.4.15 - unclear to me why this one was on 4.4.38, but that was the uname -r and thus the source tree I built against. Perhaps that was was related to some version compatibility.

I have since moved forward using a working patched version of 28.2.1.

Thank you again!

Hal

When changing the base kernel source (including patches and integrated features) it is possible it can invalidate various loadable modules. The existing modules might all work, or in a worst case, none of them will work. This is why I recommend in such a case to change the CONFIG_LOCALVERSION and rebuild all modules. This would probably be the next step…use the same source and configuration, but with CONFIG_LOCALVERSION changed…then build kernel and modules.

Changing a UVC driver probably would not be a problem for lightdm, but in odd circumstances it is hard to say.

R28.1 TX2 kernel is “4.4.38-tegra”. If you have 4.4.15, then I suspect you have the wrong kernel source. There is a significant chance that a difference between 4.4.15 and 4.4.38 causes some sort of device tree incompatibility if some driver ABI changed (or if the driver itself has a different provider, e.g., from NVIDIA versus from the stock 4.4.15 kernel).

Good point, I will try that out.

I didn’t have 4.4.15 anywhere, it was just a point of confusion I had since it was listed at Linux for Tegra R28.1 | NVIDIA Developer, but I suppose that wasn’t the issue. I suspect you are right that it was a module issue.