Locked out after booting

During a recent power outage, the Jetson AGX Oring Devkit (64GB) (Ubuntu 22) was in a suspended state. After power was restored, the device failed to boot properly—it remained stuck on a black screen and was unresponsive to any input for an extended period.

I attempted to recover the system using both recovery and safeboot methods, as well as direct booting from the boot menu, but none of these methods were successful. Since most of my data was stored on an external SSD mounted to my device, I decided to proceed with reflashing the device following the instructions provided on the official NVIDIA flashing support website.

The flashing process completed (multiple times), but upon reboot, I found myself locked out of the Jetson. None of the USB or Ethernet ports were functional—only the DisplayPort was working. Additionally, I noticed repeated error messages related to the SSH service during boot.

(During the flashing process via NVIDIA SDK Manager, there was also an error stating:

“Error: unable to read rcm_state” once, tried with many different usbports and cables, also regardless it will finish the flash like this. Regarding this, I also was trying to solve with getting this error by trying dfferent recoery mode methods (power off/power on))

Given the current state of the device, I am unable to interact with it meaningfully or proceed with further troubleshooting.

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll attempt to provide a comprehensive answer to the issue you’re facing with your Jetson AGX Orin Devkit (64GB).

Summary of the issue:

Your Jetson AGX Orin Devkit (64GB) was in a suspended state during a power outage. After power restoration, the device failed to boot properly, and you were unable to recover it using various methods, including recovery, safeboot, and direct booting from the boot menu. You then attempted to reflash the device using the official NVIDIA flashing support website, but the process completed with errors, and upon reboot, you found that only the DisplayPort was working, while USB and Ethernet ports were non-functional. Additionally, you noticed repeated error messages related to the SSH service during boot.

Comparison with similar issues:

The issues you’re facing seem to be related to the ones mentioned in the context, specifically:

  1. GPU fallen off the bus: This issue is similar to the one you’re facing, where the GPU is not functioning properly after a power outage.
  2. GPU hang on suspend: This issue might be related to the fact that your device was in a suspended state during the power outage.
  3. Kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057!: This issue is mentioned in the context, and it’s possible that it’s related to the errors you’re seeing during boot.

Potential causes:

Based on the information provided, it’s possible that the issue is related to:

  1. Power management: The power outage might have caused a problem with the device’s power management, leading to the GPU falling off the bus or hanging on suspend.
  2. Flashing process: The error message “Error: unable to read rcm_state” during the flashing process might indicate a problem with the flashing process itself.
  3. Hardware issue: It’s possible that there’s a hardware issue with the device, such as a faulty GPU or a problem with the USB or Ethernet ports.

Recommendations:

  1. Try a different flashing method: You might want to try flashing the device using a different method, such as using the lsusb command to verify the USB connection and then flashing the device using the jetson-gpu-monitoring service.
  2. Check the power management: You can try to check the power management settings on the device to see if there’s anything that can be adjusted to prevent the GPU from falling off the bus or hanging on suspend.
  3. Contact NVIDIA support: Given the complexity of the issue, it might be best to contact NVIDIA support for further assistance. They may be able to provide more specific guidance or repair/replace the device if it’s a hardware issue.

Additional information needed:

To better assist you, it would be helpful to know:

  1. The exact error messages: You mentioned repeated error messages related to the SSH service during boot. Providing the exact error messages might help identify the issue.
  2. The flashing process details: You mentioned that the flashing process completed with errors. Providing more details about the flashing process, such as the exact commands used and the output, might help identify the issue.
  3. The device’s configuration: Providing more information about the device’s configuration, such as the Linux version and any customizations made, might help identify the issue.

What is the error message? Could you provide log for us to review?

It seems harmless if you see this log in sdk manager.

Thanks

Errors during boot :

  • “Failed to start connect NVME-oF subsystem automatically during boot”
  • “Failed to startOpenBSD Secure Shell server”
  • and additional block errors at the beginning which are unreadable because it is too fast but I feel like they are the same errors diplayed multiple times

After boot I only see the “Welcome to ubuntu 22.04” window, and everything is unresponsive (tried every usb ports and multiple devices). I also tried with the microusb to try and screen into the device but I hadnt manage to log into localhost.localdomain: with nvidia/nvidia, my own user created in preconfig, also ubuntu/ubuntu.

Also now I tested that the keyboard is working before the boot finishes (I can get into the different booting menus, and the mouse has the light on) but all of it stops working as soon as the boot finishes the light of the mouse is off and everything is unresponsive.
I reflashed the card to the new jetpack version too.

You will probably need a serial console boot log to attach to this forum thread. This would say a lot. Also, even if the keyboard does not work for normal login, it is possible it will work over serial console which can save a lot of effort in any repair. If you do flash again, then to have more logging, remove any occurrence of the word “quiet” in “/boot/extlinux/extlinux.conf” (this parameter only matters for stages before the Linux kernel loads, but this is also where many errors might be noted).