Cloning Jetson Issues

Hi,

We’ve been using the method mentioned below to backup and restore Jetson. We do this for production purposes, so we can write the same image to every Jetson, and it’s been working pretty well.
https://devtalk.nvidia.com/default/topic/1000105/jetson-tx2/tx2-cloning/

There is one odd thing that is happening, however. We have a custom library installed (overwriting the one originally there) into /usr/lib/aarch64-linux-gnu/libwhatever. Also, we have a custom config file overwriting one originally there inside of /etc/pulse/file.conf.

When we back up the Jetson and restore it to a new Jetson, the file we placed in /etc/pulse gets copied to the new Jetson as expected. On the other hand, our custom library file inside /usr/lib/aarch64-linux-gnu/ does NOT make it to the new Jetson. Instead the library from the Linux_for_Tegra/rootfs/usr/lib/aarch64-linux-gnu on the flashing computer winds up on the new Jetson.

Can anybody explain or have any ideas what’s going on? We were under the impression that the APP partition was all of rootfs, and that restoring that using flash.sh with -r option would not rebuild the system.img/system.img.raw, where the rootfs would be incorporated before flashing. We were hoping for a 1:1 copy of the entire APP partition.

Does anybody have any idea what went wrong? Are we doing something improper in the restore process or is our assumption of what the APP partition is incorrect?

Thanks in advance!

It should be an exact copy. There are times when a package update command will overwrite something you wanted to keep because the update does not know of your custom version. Or udev.

The first thing I would do is to take your custom/valid system.img.raw file, make it read-only on the host (“sudo chmod ugo-w some_file.img”), take a sha1sum of this (or just a crc32 if you want a faster checksum since we’re not doing this for security reasons), place a copy of your intended system.img.raw with name system.img in the “bootloader/” subdirectory, and flash. The flasher will “do the right thing” regardless of whether the file named system.img is raw or sparse…expect raw images to take a long time.

Do not boot the Jetson after flash (and this might be tricky since the raw image takes hours to flash)…we want the checksum to be unaltered and we do not want the journal mechanism to update time of last mount…then put the Jetson back into recovery mode and clone again. Get the sha1sum (or crc32) of the new clone image, and compare to the original. If they match and there was no boot, then boot up and see if the file changed. If they do not match, then I’d say something went wrong with the clone/restore.

In general we are just trying to prove that a restore from a clone and then a clone of the clone are exact matches without reboot. Then adding reboot and seeing if the file itself changed. You can of course loopback mount the raw clone images read-only and see what is there. If there is a difference then we can look at the original file names and see what packages might think they own the file. Even if an apt-get type command was not issued there may be other processes touching something, e.g., sometimes udev will change content in “/etc” dynamically regardless of the original package being in place.

FYI, you can log any kind of flash or clone via:

./whatever_command 2>&1 | tee some_log.txt

…sadly, progress bars are really a long line of text…in a file it just doesn’t backspace and overwrite…so a large part of the log content will be very very long progress bars.

Yeah getting it exactly when it finishes w/o cold booting will be pretty tough. Though I see what you’re saying, basically just hash each image and see if it’s the same image. I feel like there’s so many things that could be slightly different though w/o repercussions, just 1 bit difference in a log file or something will change everything. It may be worth a try.

So I played with it today and got some more details on what exactly is going on, but it doesn’t explain ‘why’.
(every time I mention the library (libwhatever.so), assume /usr/lib/aarch64-linux-gnu/ prefix)

  1. The backed up image’s library (customized library) is 160KB.
  2. The library inside rootfs on the flash machine is 100KB.

Different attempts:

  1. If I flash it like above, then the new Jetson’s library file is 100KB (incorrect one, it should be the 160KB one that was on the original).
  2. If I REMOVE the library from rootfs, then the 160KB (correct) one from the image ends up on new Jetson
  3. If I RENAME the library in rootfs, then the renamed version in rootfs does NOT end up on the new Jetson, and the correct one (160KB) from the image ends up on the new Jetson

The timestamps on the system.img / system.img.raw are yesterday, so those aren’t being modified no matter what changes I make to the rootfs dir.

So basically what these results seem to say is that if this library path in rootfs matches the library path in the backup image, it overwrites it with the one from rootfs. If not, however, it does what we expected & uses the one from the backup image and does not add anything additional from the rootfs.

This is very odd behavior, and the even weirder part is that the config file in /etc/pulse is unmodified. The one from rootfs does NOT replace the one from the image. The correct one from the image is placed in new Jetson, as expected.

The workaround for the time being can be just to simply remove the library from rootfs, which results in the correct file being used, but not root causing the issue is obviously not ideal, as who knows what other files this may pertain to.

*EDIT:
I just re-flashed with both the rootfs library in place as well as the image’s in place, and the correct one winded up on the new Jetson. I might as well throw all my results out the window now. I have no idea what’s going on :(.

Flash should never touch the “rootfs/” subdirectory as a source when using the existing system.img. What you are describing does remind me of something though. If the ext4 mount has detected possible corruption, then it replays the journal in reverse and removes the changes which got to the corrupted state. It isn’t all that different from a database doing a rollback of a transaction.

Recreating things may remove the corruption. That corruption can exist on either the host or within the loopback mounted system (this latter is by far more likely…but if the host used inodes on its disk which somehow were corrupt and this transferred into the image then host side corruption could also get in the way), especially if the Jetson the clone started from was shut down improperly before cloning…the journal and the corruption would be cloned as well, and loading onto the Jetson would imply the journal and everything leading up to the corruption would be in place and ready for replay. If the system had the original file in place, then a copy of the new lib added which for some reason causes rollback of the journal, then you would get exactly what you see and it would only appear to have used the version from “rootfs/”.

FYI, when I go to study a cloned rootfs I generally mount with “-o ro” for read-only. If you know which loopback device is covering your image, then you can also run “sudo fsck.ext4 /dev/loop0” (or whichever loopback device it is). If the image file has not yet been covered by loopback, then you can “sudo losetup -f” to see the next unused loop device (sudo causes it to also be created if it didn’t exist), then cover the file with losetup and run fsck.ext4 on that loop device number.

It would be interesting to see if fsck.ext4 likes your loopback mounted image (and remember that whenever you mount this image on your host this too can alter the image by replaying the journal and mount count).

Hmm, the journals are a good point. I agree with the point that rootfs/ shouldn’t be touched after the image is made. I think what I saw was just a coincidence, as it seems given that I was using the same image, unchanged and unmodified, I saw the first flash give the 160KB version, and subsequent flashes have the 100KB version. I don’t believe I did anything to cause that change.

Also, it’s probably important to mention, some of these Jetsons aren’t new. Some are running old images (some very old) and what-not. So many of them aren’t getting flashed from “scratch” per-se, but rather flashed right on top of what was there. Could this have any impact? We were presuming it would not since we’re assuming it’s a byte-byte copy, but there’s no “erase” step that we should do or anything prior to flashing, right?

As for shutting down, yeah it should have been shut down correctly (but that was a week ago, can’t recall), and assuming so, the journals should have been flushed and cleared, I would think. Also, this same exact issue happened with the last TWO different back-ups, and they were done by separate people. I doubt both of us didn’t shut the Jetson down both times.

I can do the fscheck, but given this happened to the same file both times backups were made of this Jetson, it makes me think it’s something else. The /usr, /usr/lib, etc. dirs seems to not be its own partition, so I don’t think it’s that. Otherwise there shouldn’t be anything special. I’m willing to bet though if I make another backup of that Jetson now and flash it the wrong one will be included. :-. I’ll give that a try and the fs check.

Flashing a clone (or even flashing from a generated system.img) is not a normal file copy…it creates (or reads) the file system on the host and does a bit-for-bit exact write of bytes into the partition without any knowledge of the file system. The flash process has no concept of corrupt data, corrupt files, truncation, so on. It is ext4 which understands those concepts. It isn’t possible to have part of the old file system still be there after a flash since the file system itself is overwritten…any bits which may have been from a partition originally larger than the current partition could be there, but the partition itself would know those bits are beyond its limits and would be unable to touch those bits.

It is highly likely that if the flash procedure was correct, then it was in the ext4 content that strange things occurred. I don’t know of a way to see if those previous clones needed fsck other than to cover the image with loopback and then fsck.ext4 on the loop device file. After this you can mount the loopback clone read-only and see if your files are as expected.

If your host file system is filled, then you might expect a truncated image. The truncated image may need fsck, and yet otherwise not tell you it was truncated. This is unlikely, but you may want to run “losetup --find” to see what the first available loop device is prior to a restore or regular flash…if it isn’t loop0, and if there is still something on loop0, then results might be strange (though likely there would just be a rejection due to failed loopback).

The flash script itself has incorrect code in picking the first loop device, and if the script has both loop0 available and loop1, and the clone is on loop1, it will pick whatever is on loop0.

Here is my suggested edit to flash.sh (perhaps this can be put into R28.2?):

463 build_fsimg ()
 464 {
 465         echo "Making $1... ";
 466 #       local loop_dev="${LOOPDEV:-/dev/loop0}";
 467         local loop_dev="$(losetup --find)";
 468         if [ ! -b "${loop_dev}" ]; then
 469                 echo "${loop_dev} is not block device. Terminating..";
 470                 exit 1;
 471         fi;
 472 #       loop_dev=`losetup --find`;
 473         if [ $? -ne 0 ]; then
 474                 echo "Cannot find loop device. Terminating..";
 475                 exit 1;
 476         fi;

You can run into problems if you mix flash of a clone which is from a different release than the other partitions are from. Even if it appeared to work at first I’d expect something strange to occur later on. Some differences exist outright between R27.x and R28.x, e.g., the device tree is read in a different way and device tree itself would probably break if you mix a clone with another version of the rest of the flash. However, this would not lead to part of the old file system still existing…any file reversion would be part of the ext4 journal.

Yeah, I figured it was a dd-style copy that didn’t care about the FS, but I just wasn’t sure if it did any sort of “file checking” as part of the flashing process. Especially considering that config I mentioned inside /etc/pulse is never overwritten. The correct one is there every single time. It makes me think something is doing some sort of “checking” and not liking that library, and overwriting (or restoring) the old one or something. There’s nothing special about the path /usr/lib as far as I know, but who knows, perhaps something is checking it.

MId-day yesterday every new flash seemed to have the 160KB version on it, so I figured perhaps the issue is resolved? I don’t know what to think at this point but right now knowing that I’ve gotten the 160KB library AND the 100KB library at different flashes throughout the day knowing that the system.img/system.img.raw were last updated the day before terrifies me. That’s literally different flashes with the same image yielding different results.

To remove permissions from the matter, I made sure permissions and ownership were set to same as the original (they originally were different, and not in a good way). Then I did a backup, making sure everything shut down properly etc. so that all the FS journals would be cleared correctly and everything should be good with the FS. Did the back up, did a restore on another Jetson today, and bam, 100KB file again :’(.

I can’t imagine this is a FS issue at this point. Though never-the-less I’ll see if I can do look at doing an fschk on the image.

I wonder if there can be an issue with aptitude in Ubuntu. I’m going to look into that as well to make sure it’s not overwriting anything.

No space issues on host.

Yeah, there hasn’t been any mixing and matching of BSP flashes. It’s been backed up from R28.1 and to R28.1.

No checks are done during flash…if for example your host file system fills up and the image is truncated there will be absolutely no issue known until the Jetson boots and something is missing.

File copies within the host (and file copies are done during image generation, but not during clone) could be rejected if not using root (sudo). On some systems where SELinux is enforcing you will see problems even if you are using sudo…but the Jetson itself is not set to enforcing SELinux so any SELinux labels transferred to the Jetson will be ignored. To see more information on your SELinux setup the two commands may help (these commands may differ somewhat between distributions or versions of a given distribution):

cat /etc/sysconfig/selinux
# or:
sestatus

Typically SELinux errors will be noted in “/var/log/secure” or “/var/log/auth.log”. You can run "sudo tail -f " on the host during a flash and see if anything about permissions are noted…you will get notes about sudo, but those are expected.

If anything has loopback mounted the image, then it is possible a journal replay changed it. One thing you can do is once you have run fsck.ext4 on the loop device and it is known clean is to chmod the image to read-only. Then do any examination with mount option “-o loop,ro” as well.

Note that ownership of the system.img does not matter during a flash since this is to be done only with sudo (and root can read anything SELinux doesn’t deny)…read-only file permissions can help to be sure your image is not modified when using a clone to restore or when examining via loopback mount.

It is difficult to check the eMMC on a running system. You can run this though and it won’t try to repair anything, but it may offer clues:

sudo fsck.ext4 -n /dev/mmcblk0p1

Hopefully part of the output will be something like this:

/dev/mmcblk0p1: clean, 356755/1572864 files, 4905977/6291456 blocks

So what I’d suggest is take an image you think is fresh and clean. chmod to read-only. Mount it on loopback with “sudo mount -o loop,ro <image_name.raw> /somewhere”, see if the file is correct. Run “sudo fsck.ext4 -n /dev/loop0” (or whichever loop device it is for your case) and look for a declaration of it being clean. Before you restore from a clone make sure the command “losetup --find” shows “/dev/loop0”…otherwise it might be using a different previous loopback mount when creating new images (but this shouldn’t be an issue when re-using a cloned image).

You can loopback mount a clone on the host read-write and add a note to the root to identify this exact clone regardless of which Jetson the clone went to. Example:

sudo mount -o loop /where/ever/it/is/clone.img.raw /mnt
sudo touch /mnt/clone_serial_number_1.txt
sudo umount /mnt

Use a different serial number on all clone sources. On a running Jetson from which you are about to create a clone do a bit of a handshake prior to creating the clone (I’m assuming you can call one Jetson “1”, another “2”, so on):

sudo touch /root_fs_clone_source_jetson_1.txt

…after cloning you would be able to confirm both “touch” files uniquely identify what you expect.

Another note on permissions: The loopback file itself, after it is initially created, will be populated via the “rootfs/”. If sudo was not used, then the permissions on the loopback image will be incorrect/invalid and the system will boot, yet have various odd failures (e.g., sudo won’t work). It is a good idea on any newly flashed system (regardless of whether it is restored from a clone or from a newly generated image) to see if “sudo ls” works.

So serials aside I did most of what you speak of before seeing this message. I think you’re right about fs damage now. On first boot of clone I see this from dmesg, which is worrisome - the last 2 lines happen every boot, but the 1st 2 happen only on the 1st boot, and obviously appear to be “fixing” things:

[    5.010495] EXT4-fs (mmcblk0p1): 13 orphan inodes deleted
[    5.015929] EXT4-fs (mmcblk0p1): recovery complete
[    5.028371] EXT4-fs (mmcblk0p1): mounted filesystem with ordered data mode. Opts: (null)
[    5.036490] VFS: Mounted root (ext4 filesystem) on device 179:1.

Now interestingly enough, I immediately flashed the same jetson (same image, everything the same, just re-ran the flash command) and on 1st boot of the clone it did NOT have those first 2 lines, and likewise it had the correct library from the image. So this definitely points to what you’re saying of a damaged fs replacing the library. Now there’s 2 mysteries: 1. Why is it getting corrupted, and 2. Why doesn’t it happen to all clones.

I changed the permissions as read only and:

I ran fsck.ext4 on the image directly, ie.

sudo fsck.ext4 -f /path/to/image.img.raw

Then I mounted using:

sudo mount -o [ro,]loop /path/to/image/img.raw /mnt/dir

(tried with and w/o ro option)

If I ‘ls -l’ the file path in /mnt/dir/usr/lib/… It is the correct (160KB) one.

While still mounted I tried what you said by using fsck on the loop device, and did:

sudo fsck.ext4 /dev/loop0

Though it would not let me, it said I do not have permission to access, even when root, ie.

# fsck.ext4 -f /dev/loop0
e2fsck 1.42.13 (17-May-2015)
Warning!  /dev/loop0 is mounted.
fsck.ext4: Operation not permitted while trying to open /dev/loop0
You must have r/w access to the filesystem or be root

When the FS check is done on the raw image itself the report was that 0.1% is non-contiguous, and it did not say “clean” if I give the -f (force check) option, but it did not give me any errors, either. If I don’t provide the -f option, it says ‘clean’.

I then backed up the original Jetson again to create a new image. I then flashed this image to the clone Jetson with the images marked as RO. The clone didn’t show the issue. I’m not hopeful that this is a solid fix, though. Even if it was it doesn’t make sense how the system.img would get modified but the dates not modified.

So far it seems like consecutive flashes don’t display the problem or something odd of the nature. I don’t get how any of this is possible given that the image is the same. I wouldn’t treat that as stone cold fact, however, just a trend I noticed. Seems like after I’ve gotten the 160KB library on a clone once, successive clones all have that one. Perhaps I’ll try hashing the fresh backed up image again.

So yeah, definitely seems like there’s something going on with the file system, I’m not sure why there would be orphaned inodes…the configuration script copies the library to that location using a regular “mv” command, not really any trickery being used… I don’t get how they’re broken in the clone but not the original given that it should be byte-byte copy.

I don’t believe clone itself is capable of any file system editing/corruption. I say this because it is a bit-for-bit “dumb” copy, and for any bit to fail the entire device probably needs to go bad…you would have far worse problems showing up if the clone itself were an issue.

However, the timing of the clone and any non-ro mount of the clone can change things. If you clone from a system which was not clean, then the clone will not be clean. If you have your image on your host and forget to use the “-r” just once, then the rootfs subdirectory will be copied into the image (if it survives…more likely it gets truncated of all data and a new file system completely erases it…invalid assumptions about which loop device to use can get past this and cause editing instead of truncation).

There are times when shutdown will fail to cleanly umount…you might watch the serial console during shutdown and keep the log until you know if the clone shows as clean and with the correct files. You can partially fight this by running “sync” prior to shutdown, but this reduces the life of solid state memory…it’s ok as a test, but I wouldn’t do this after every possible disk write issue the way I would with a regular hard drive.

Once you are on your host and if you have mounted the image in any way other than read-only anything locking the file system when forcing it to umount (such as if you cd to that mount point but don’t cd out before umount) would cause this.

The thing about a journal is that there will be no orphan nodes so long as the journal is replayed. Orphaned nodes tend to be the terminology of losing some unknown file system content as an emergency measure to avoid complete file system corruption…if you can replay the journal it reverses only the most recent changes…should you mount the ext4 partition as ext2 (meaning no awareness of journal replay), then you would indeed get orphaned nodes.

Yeah I think you’re right. I think there’s corruption that happens each time on the master Jetson and it’s being imaged to the clones. I don’t think the image is getting modified on the host because the date doesn’t change. Currently after each master back-up I’m hashing the image, then checking it after cloning to verify that it hasn’t changed, but I don’t think it is.

I agree that it’s much more likely there’s something wrong on the FS prior to backup, which is bit-bit copied to the clone, and the clone is recovering it on first boot.

Seeing as it’s able to bring the old file back, that makes me think it’s a journal issue, right? If not I can’t imagine where it’d be getting the old file from. The other aspect I don’t get is how this could happen, the configuration script we run on the master downloads the library, and copies it to /usr/lib with a standard “mv” command.

The sync call is a good idea, perhaps we can try that at the end of the configuration script, though that feels like a workaround to the root cause, as I’m not understanding why that would happen.

Since we’re not supposed to really run FS check on the running system, is there any way to look for orphaned nodes on the system, as part of error detection, prior to backup?

This mv command is on the host PC? Or on the Jetson? Either way this is a candidate because caching goes on unless the disk is mounted “-o sync” (which is a terrible idea on solid state memory…the life would go down fast…it’s just a big performance hit on regular drives). Anything truncating the operation or stopping the disk prior to write of the cache (which might happen several seconds later) would cause this issue.

The file going to the lib directory…you may want to checksum this at both the source and the destination. In the case of the destination, if this is on a loopback image, umount and remount the image before looking at the checksum. You might even consider running fsck.ext4 on the loop device since it would still be covering the file (this of course assumes it is a mounted file via loopback…perhaps it isn’t). The sync command would avoid many of those possible problems so long as the actual power to the disk or device accessing the image does not blink or go away (there’s a wild possibility…ever have brownouts or surges?).

Running fsck.ext4 on the loop device file covering the image, with the image not mounted, will allow you to actually recover in the case of a journal replay. If nodes are truly orphaned they end up in the “lost+found/” subdirectory (been a long time since I’ve seen nodes there…journals pretty much ended the usefulness of that).

If you create an SD card you can boot to, then you can run fsck.ext4 on “/dev/mmcblk0p1” since it won’t be mounted.

On R28.1 (and many others) you’ll see the kernel command line in extlinux.conf has “rw” in it…this is the note to the kernel to mount the root partition read-write. This can be replaced with “ro”, but you won’t be able to write anything at all…including temp files, so there are many possibilities for seeing errors not related to anything but the read-only status (e.g., .Xauthority files are needed for GUI login…you can’t write those in ro). I have this in my extlinux.conf and can select it with serial console at boot time (the only change from the default entry is “ro” instead of “rw”):

LABEL read_only
      MENU LABEL read_only
      LINUX /boot/Image
      APPEND ${cbootargs} root=/dev/mmcblk0p1 <b>ro</b> rootwait rootfstype=ext4

Evolution and gnome-sof were reported to have loaded this library. I have no idea why, it’s a video library, but whatever, it says they are using it, and it wasn’t immediately obvious that they’d be using a video library like this. Also it still doesn’t answer why the same thing didn’t happen to the configuration file in /etc/pulse/file.conf. Same thing, just a ‘mv’ command to overwrite it and pulse would have been using that, too (although the file may be simply closed at that point, it won’t remain loaded like a library, and they probably don’t keep the handle around).

I’m wondering if the act of trying to overwrite it while it’s loaded from those background processes is what’s causing the issue? I wouldn’t think so because I feel something like that at worst would just cause those processes to crash, though maybe there’s more to it.

Another interesting point is that not every time the library reverts to the original version do I see the error in the kernel about the FS damage. There was one Jetson which I booted for the first time which had the original library (100KB) but it did not have the FS recovery message I posted above.

So I tried this method 2x, it’s not to say it’s a definite thing, but I added a pause to the config script right before it copies the lib, then when it does I make sure those processes (evolution and gnome-sof) are not using it, and if they are, I’ll kill them. Then proceed to the copy and continue. In the 2 attempts (each attempt was fresh flash, backup, then restore new Jetson) neither one gave me the old file. Does this mean it’s resolved? I don’t know, maybe? The solution seems sloppy to me, but what the heck else am I supposed to do if it’s being used?

Does this sound feasible? Or does it sound like I’m off in the weeds here and this has nothing to do with the issue. I’ve never seen a file restored after shut down quite like this.

It seems like having two processes write the same file at the same time would have a safety mechanism, but if they were different threads of the same process, then perhaps this could occur even with mechanisms in place to protect it. It’s just conjecture, there really isn’t much of a way to know without having the drivers and the programs all in a debugger at the moment it occurs.

You can use the “fuser” command on files to see who is accessing, and also have the process doing the access (see “man fuser”). You could even get tricky with it and have log what it finds at each shutdown and call sync.

Historically, concurrent access issues have been some of the most difficult to solve. You might be right, but often these conditions only occur momentarily and at odd moments. The lack of ability to consistently cause the condition is a big part of the pain in figuring it out. One day some odd clue may make everything obvious. An old colleague had a technical term for that…“Well, duh!”. That’s the usual phrase when finally figuring out something extremely difficult and finding it was actually something quite trivial.