Jetpack 4.6 4.9.253-tegra disk encryption via dm-crypt seems to lead to kernel dead locks

Hi there!
We are building on the Xavier AGX platform with 8GB of RAM and 6 CPU to provide both inference and model training on the same platform. Since we are shipping IP on the device, we are encrypting an NVME that stores our data. We currently use LUKS encryption with the cryptsetup utility and the following command line: sudo cryptsetup luksFormat /dev/nvme0n1; this seems to work to have an encrypted disk on device that stores our data.
We have never experienced any system lock ups until we started encrypting the NVME as mentioned above. Note also that our swap partition is on the encrypted NVME partition.
Now we are experiencing random lock ups of the Xavier every so often. It seems like a kernel dead lock, but we cannot be sure because all logs (and the device) just “stop”. Only a hard reset fixes the issue. These lock ups are occurring under load or while idle, there seems to be no overt triggering condition that pops out at us.

Due to our industrial system supplier shipping their own BSP, we are tied to Jetpack 4.6 on the Xavier currently. Are there any recommended instructions to encrypt the NVME partition on Xavier that ensures compatibility with the current 4.9.250-tegra kernel?

Thanks!
Matt

When you say there are no logs, does that mean you have serial console recording at the time of failure? If not, then you should provide such a log. Even seeing the messages which show prior to the failure might be of use.

Also, do you have a locally attached keyboard? If so, then you could try some Magic SysRq key combinations.

There are ways to use this via serial console, but if it truly locked, then I think a locally attached keyboard has a better chance of working (though you can try both). With a serial console attached, use the locally attached keyboard and run this key combination:
ALT-SYSRQ-s
(SYSRQ is the same key as “PrintScreen”, but unshifted)

That command says to run “sync” on the disk. This would result in log output the serial console can see if it works. Knowing if this responds would be very useful. I recommend that you try this on a running system with a serial console and local keyboard just to see how the log should work (sync won’t really change much, but it will show SysRq is working).

The full combination you could try (even if logs don’t show), just to see if it reboots or shuts down normally:

# Sync twice:
ALT-SYSRQ-s
ALT-SYSRQ-s
# remount disk read-only:
ALT-SYSRQ-u
# Force reboot or shutdown:
ALT-SYSRQ-b

For serial console to work you have to already have logged in and run “sudo -s” so that serial console is connected to user root prior to the issue. The equivalent, if you are on the serial console keyboard, would be this key combination when logged in as root:

echo s > /proc/sysrq-trigger
echo s > /proc/sysrq-trigger
echo u > /proc/sysrq-trigger
echo b > /proc/sysrq-trigger

If those do not work, then you have a truly hard locked system. Finding any output on serial console says something might be alive. Lack of serial console output means something truly disastrous has occurred.

When the issue occurs we generally have several SSH sessions connected to the device and are tailing multiple logs. The observation is that the system stops responding completely. You made a really good point about connecting to the system via the serial port. We disable the serial connection via these commands:

service nv-l4t-usb-device-mode stop
systemctl disable nv-l4t-usb-device-mode.service

Is there a way to re-enable the serial console? I’ve noticed that I can not run the inverse of the above (service nv-l4t-usb-device-mode start) as the service does not seem to exist on the system any longer. I am also unable to find a corresponding package for the service. If you have any pointers to documentation, I’d much appreciate it. Otherwise I’ll re-flash jetpack.

In the last few days we found a tight loop that was leaking sockets at a pretty fast and steady rate. We had noticed that the system would run up to the generic Ubuntu ulimit per process and increased that ulimit to allow for more open files. I think that we may have starved the system of file descriptors as we noticed that generally sshd would start failing in the logs before the system would lock up completely, which corresponds to us being unable to ssh into the system (ping would still respond though). Our ulimit setting for the process in question is 65535 and the hard limit set by the kernel is 1048576; socket starvation seems to fit the symptoms as a root cause, so we are continuing to monitor and if it happens again, we’ll get logs from the serial console and provide more information.

Re-enable:

sudo systemctl enable nv-l4t-usb-device-mode.service

sudo systemctl enable nvgetty.service

(it might already run upon enable, but you could “sudo systemctl start nv-l4t-usb-device-mode.service” or reboot)

Be sure that within the serial console you log in and also “sudo -s to get sudo running prior to any issue occurring.

Yes, resource leaks (such as sockets) would be fairly high up on the list of probable causes. Within your serial console connection you could in fact run “dmesg --follow” and it might make such an indication of resource failure. If the system is completely locked, even to serial console, then you’d at least get the last log. CTRL-c to get out of the log might work, in which case you could use the echo commands for Magic SysRq.

Excellent - thanks for all the info! On my system it does not appear that the service exists any longer and I am wondering what NVIDIA deb I need to reinstall to get it back:

$ sudo systemctl enable nv-l4t-usb-device-mode.service
Failed to enable unit: Unit file nv-l4t-usb-device-mode.service does not exist.

We’ll keep an eye out on the socket starvation and reduce our ulimit for file descriptors for the process again to get an early indication of failure.

Thanks!

My mistake. I copied your previous systemctl command which was not for serial console. Try this:

sudo systemctl enable nvgetty.service

Actually, I think I misunderstood - I was thinking about the micro usb interface and called it the serial console. I’ll need to get a DB-9 - DB-9 female-female connector to be able to connect to the serial RS-232 interface. Is there a way to re-enable the micro-usb console connection? Usually, when I flash jetpack, I can disable the nv-l4t-usb-device-mode.service, which disables the micro-usb console connection on the unit. But I cannot restart it as the service file mentioned above is not found. Sorry about the confusion and thanks for your help!

It is a serial UART protocol. The old DB-9 is not used on anything for close to a decade. The old TK-1 used this, and every model since then uses either (A) 0.1" spaced header pins, or (B) the UART directly embedded in the micro-USB connector.

The AGX Xavier development kit does provide a serial UART over the micro-USB connector. First monitor “dmesg --follow” on your host PC. Then, as you connect a micro-B USB cable to the Xavier, and the other end to a host PC, log lines should appear. One of those lines will name the “/dev/ttySomething” which your host PC can use for serial console.

If this is a custom carrier board, then it might also have a micro-OTG socket (which can take a micro-B USB cable), and probably it does. Some details might differ.

Got it - I’ll give it a shot - thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.