Recovery Mode

ben.kawecki · June 18, 2019, 7:35pm

Hello,

I accidentally put a program that hangs in the /etc/rc.local file with the exit 0 return afterwards. Because of this my Jetson wont boot into ubuntu and I can’t edit the file. I’ve tried holding the regular Ubuntu keys during the bios load but they didn’t work. Whats the best way to get shell access to the box without loading the /etc/rc.local file?

Ben

linuxdev · June 18, 2019, 9:56pm

Embedded systems don’t have a BIOS. Any halt of boot would be via U-Boot, but Xavier skips U-Boot and goes directly from CBoot to Linux.

Serial console is the definitive way to get in when all else fails. Although it is possible this file could interfere even with serial console it is unlikely to stop this.

On the same side of the Xavier where the 40-pin header is, note the type-C USB on the right, and a micro-USB on the left as you face the header. This micro-USB, when it has a micro-B type cable inserted (and this is what NVIDIA provides) to a non-recovery mode Xavier, offers some “devices” if the other end is connected to your PC. I don’t know what this will be identified on for your PC, but if you monitor “dmesg --follow” as you connect this it should tell you. On one of my systems this shows up as “/dev/serial/by-id/usb-FTDI_Quad_RS232-HS-if03-port0”.

Any serial console program should work, but I like gtkterm (“sudo apt-get install gtkterm”). The proper settings are speed 115200, 8 bits, no parity, one stop bit. Hardware flow control is not used. For gtkterm this would be:

sudo gtkterm -b 8 -t 1 -s 115200 -p /dev/serial/by-id/usb-FTDI_Quad_RS232-HS-if03-port0

Note that “sudo” is only required if your PC user is not a member of the “dialup” group. An admin user is probably already a member of this and won’t need sudo. Try without sudo first, and if you get a permission denied, only then try with sudo.

This console will show boot messages, and after boot will offer a text mode login. The user logged in as will be the admin user, typically user “ubuntu” on earlier releases (or whichever user you added on first boot). You can then use “sudo” to mv the rc.local somewhere else, and “sudo touch /etc/rc.local” to create an empty file in its place.

ben.kawecki · June 19, 2019, 2:19pm

Thanks for the response, I went ahead and tried to connect over serial via the micro usb. I was able to get a terminal running; however, it seems that the terminal needs CBoot to exit before it can do anything. Since the rc.local file runs at the end of Cboot the serial terminal never gets access.

Do you think theres a way to mount the file system on another computer? I vaguely remember mounting the file system on the host computer during the set up process. If I can mount the file system then I can make the necessary changes. I’m just unsure what putting my device into forced recovery mode will do.

linuxdev · June 19, 2019, 9:37pm

rc.local never runs until the rest of multi-user.target finishes. If CBoot is failing, then it won’t be due to rc.local. I would be curious to see what logging is visible from serial console.

Although you cannot directly mount the filesystem on another computer, you can clone the rootfs (which produces both a mountable raw loopback image and a non-mountable sparse image), edit that clone on the PC, then reflash using the edited clone as rootfs instead of regenerating a default rootfs. Be warned that this takes a lot of time and disk space.

The short story is that if you clone and name the clone file “backup.img”, then you will also get “backup.img.raw”. Throw away the “backup.img” and keep the “.raw” version. See:
https://devtalk.nvidia.com/default/topic/1048747/jetson-agx-xavier/cloning-xavier-with-jetpack-4-2/post/5322540/#5322540
https://devtalk.nvidia.com/default/topic/1039548/jetson-agx-xavier/xavier-cloning/post/5330276/#5330276

As to the actual clone operation, basically you put the Xavier in recovery mode with the correct USB cable just like you are going to flash. If you’ve run JetPack or SDK Manager before, then you will have a “Linux_for_Tegra/” subdirectory in there, and that directory will have “flash.sh”. From that location this is the basic command for clone:

./flash.sh -r -k APP -G "backup.img" jetson-xavier mmcblk0p1

Remove “backup.img”. You can then loopback mount “backup.img.raw” (example is on “/mnt”, but you can pick somewhere else):

sudo -s
mount -o loop backup.img.raw /mnt
cd /mnt
# Explore
ls
# Go to rc.local:
cd etc
cat rc.local
# Edit this or do as you want, save.
# You can't umount if you are in that directory:
cd /where/ever/flash.sh/is
umount /mnt

Now if you want to flash this edited clone:

# This will take a lot of time...the file is around 28GB...two copies implies 56GB...check "df -H ." prior to doing this so you know how much space your host PC has:
cp backup.img.raw ./bootloader/system.img
# Reboot the Xavier back into recovery mode...you can't clone and flash without reboots between.
sudo ./flash.sh <b>-r</b> jetson-xavier mmcblk0p1

(the “-r” is very important)

Note that if the rootfs is a non-default size you may need to explicitly set the size. Size is set either in “MiB” (1024 x 1024) or “GiB” (1024 x 1024 x 1024). If your backup.img.raw is evenly divisible by 1024 three times, then this is GiB, or if only evenly divisible twice by 1024, then it is MiB (other values are not valid). So if my image is 28GiB I could explicitly state this:

sudo ./flash.sh -S 28GiB -r jetson-xavier mmcblk0p1

ben.kawecki · June 25, 2019, 2:25pm

Hmm, so the only change I made before when it errored was adding my code to the /etc/rc.local file. I essentially changed my rc.local file to be this:

#rc.local
python my_script_that_hangs.py
exit 0

The output during the bootup was the exact same as normal, except at the end it had the logging to stout that I normally output from my script.

I sadly didn’t’ have time to try to recover using the way you proposed. We had a deadline coming up so I just reflashed the OS and reinstalled everything from scratch. If you want I can probably build out a script similar to mine that will cause the boot to hang incase anyone wants to duplicate the error.

linuxdev · June 25, 2019, 8:24pm

Btw, this format is blocking if the python script blocks:

python my_script_that_hangs.py

Add a “&” to the end to let the script run in the background without blocking:

python my_script_that_hangs.py <b>&</b>

nihohi · January 10, 2020, 8:10am

I faced similar situation that Xavier does not boot after I modified settings in fstab.
Above comands were useful to back up. Thanks.

Then I noticed one thing for writing image.
The command should include -k APP option to write image backed up with -k APP option.
So it should be

sudo ./flash.sh -r -k APP jetson-xavier mmcblk0p1

instead of

sudo ./flash.sh -r jetson-xavier mmcblk0p1

At first, when I backed up a root file system(rfs) image from Xavier with -k APP option and to write it in Xavier without -k APP option after modifiled.
But it didn’t boot.
Because of probably I wrote rfs to another area in mmcblk0p1.
flash.sh changes area to write when you use it with -k APP option, like this

APP_TAG+="-e s/size=1073741824/size=${rootfssize}/ ";

Then I noticed when I use -k APP option to back up rfs, I should use the command with -k APP when I write it in rfs area.

Thanks,

linuxdev · January 10, 2020, 6:40pm

Were you trying to clone or write? This is the command for clone:

sudo ./flash.sh -r -k APP -G my_backup.img jetson-tx2 mmcblk0p1

Note that if your image is not the default size, then you probably need to specify size during the flash. For the raw image (clone gives you both a “backup.img.raw” and a “backup.img”, were “.img” is sparse and “.img.raw” is raw) you should be able to divide the size twice or three times by 1024. If the number is divisible twice, then that is the size in MiB; if you can divide evenly three times by 1024, then that is the size in GiB. As an example, if you cloned and got backup.img.raw with a size of 28GiB:

sudo ./flash.sh -S 28GiB -r jetson-xavier mmcblk0p1

I do not know if the APP_TAG you used is correct or not…perhaps this was valid.

nihohi · January 21, 2020, 6:37am

Thank you for your comments.

I tried to clone and write.
Then I used this was for clone and it works well.

sudo ./flash.sh -r -k APP -G "backup.img" jetson-xavier mmcblk0p1

And I used this was for writing, then Xavier booted in my environment.

cp backup.img.raw ./bootloader/system.img
sudo ./flash.sh -r -k APP jetson-xavier mmcblk0p1

I used default size, so I didn’t use “-S” option.

visionarymind111 · January 8, 2021, 7:02pm

What options are available when the Xavier AGX is not responsive on /dev/serial/by-id/usb-FTDI_Quad_RS232-HS-if03-port0? Dmesg shows the ports becoming available, and I am able to connect with gtkterm (and minicom), but there is only a black screen. Has NVIDIA provided any other ways to access these devices once they have become unusable due to, for example, improper fstab configuration?

linuxdev · January 8, 2021, 7:40pm

If you have lost all access (I saw the other fstab thread, not sure if that was yours), then you would clone, edit the clone via loopback on the host PC, and then flash the repaired clone back onto the Jetson. In some cases you might get a bootable SD card as a rescue and edit directly on the Jetson. Clone is slower, but clone is king since it is a permanent backup in case something goes wrong later on.

visionarymind111 · January 8, 2021, 8:00pm

It is mildly shocking that one minor change to fstab turns a powerful Xavier AGX into a useless paperweight, with no recourse but a prolonged and painful clone process. So I now have three options, it sounds like:

Re-flash the AGX and spend another two weeks re-building and hacking libraries that should rightly be compatible with this hardware.
Clone the rootfs, fix the issue in /etc/fstab, then attempt to flash it back to the device. Sounds good on paper, but likely is going to be as big as a headache as re-building things.
Send the device back to NVIDIA and go and purchase a proper Intel NUC with desktop architecture that will work out of the box.

I am leaning for #2 but could be convinced of #3 if I spend the entire weekend getting this up and running again. I am not at all impressed by this hardware.

visionarymind111 · January 8, 2021, 9:06pm

Update here. I looked into option #2 (cloning rootfs on Xavier AGX), and it does not appear to be possible once the fstab has been changed. Such a change prevents it from booting properly, and I do not see that there is a way to reset the eMMC to even re-flash the unit. To anyone reading, be aware that these devices are permanently bricked by any ill-conceived fstab changes. There is no possibility for serial connection, cloning the rootfs, nor even re-flashing.

@linuxdev, I took at look at your instructions here, but they depend on the ability to connect to the Jetson via the USB-C in SDKManager, unless I have misunderstood something. Without being able to boot, the Jetson is not visible to either a Linux host connection via USB-C / microUSB serial, nor is the eMMC accessible using any other method.

So it looks like I am stuck with #3, unless someone here has an idea how to save this device.

dusty_nv · January 8, 2021, 9:59pm

To clone the device, enter it into recovery mode - this is different from booting. It will then show up under lsusb to a Linux PC and you can clone it. Use the L4T tools to clone (i.e. flash.sh), not SDK Manager - or use the L4T tools that SDK Manager downloaded/extracted when you initially flashed the device (SDK Manager itself doesn’t do cloning)

visionarymind111 · January 8, 2021, 10:59pm

Sorry for not making it more clear earlier, but I have been putting the AGX into recovery mode, and it reaches a point in the boot process where it freezes on “Starting D-Bus System Message Bus”, cycling every minute, attempting to initialize the system, and then repeating the message, endlessly. During this cycle, the AGX does not appear to a Linux PC under lsusb. It is invisible to the Linux system. I presume this is because it does not make it far enough into the boot process to invoke the drivers.

Is there anyway to at least perform a “factory reset” on this device? It is essentially dead at this point, from nothing more than an fstab modification.

visionarymind111 · January 8, 2021, 11:48pm

I finally got it working. Thank you all for pointing me in the right direction. The problem was that the SD card with the improper fstab mount needed to be removed before the AGX would go into full recovery mode. Everything is back to normal now. I just received a 1Tb NVMe SSD, and will be using that for boot from now on.

linuxdev · January 10, 2021, 12:16am

This is usually true for all operating systems. Rescue differs across systems, and this is longer and more involved in the embedded world, but the Xavier is not actually broken, only the software is broken. Adding such support in hardware would require more cost, more weight, larger physical size, and increased power consumption. Such changes would be a benefit only when something goes wrong. Yes, rescue is more difficult without those changes, but it isn’t worthwhile for most people. fstab is something requiring root authority to change, and as root one has to be careful.

I have gone with the clone and fix many times, often just as practice (remember the saying that a backup plan without testing is not backup). It is actually rather simple once you see how it works. The actual time to transfer that much data, and making sure your host PC has enough room is also a problem, but if you are past that, then the result is quite consistently good.

Embedded systems exist for a reason, but the limitations can be severe in comparison to a desktop PC system. If you don’t need improved weight, or physical size, or lower power consumption (and thus easier cooling), then you could do well with a NUC. As soon as you need support for some specific purpose with any of those above requirements it becomes time to go with an embedded solution. There is a learning curve though.

Clone does require a USB-C cable to the host PC. One would use command line to perform the clone, and although JetPack/SDKM installs this, the actual content is called the “driver package”, and JetPack/SDKM is only a front end installer and operator of that command line content. Cloning on command line without running JetPack/SDKM is the way to go and makes this much simpler. You must be sure the Jetson is in recovery mode though (the clone will work no matter what content the eMMC has…fstab or entirely wiped out partitions will in no way prevent a clone unless the partition is actually missing). When a Jetson is in recovery mode it becomes a custom USB device which the driver package understands…and there is no dependence on any flashed content. When the Jetson is not in recovery mode there is no way for the driver package to talk to a Jetson. Recovery mode is not a way to turn a Jetson into a mass storage device (which is a standard USB device and not a custom USB device).

The micro-B USB cable for serial console is not needed for flash. Often it is requested, but only if logs are needed for figuring out why a flash (or clone) failed.