mksparse behaving incorrectly

Hello,
I’m having trouble flashing my TX2s from one workstation, yet it is working on another. Both workstations are running ubuntu 16.04 with no obvious configuration differences. I’ve tracked the problem down to mksparse behaving differently between the two workstations. On both workstations, the system.img.raw is created with identical size but on the non-working workstation the resulting sparse system.img is much smaller than the sparse system.img on the working workstation. In both cases the --fillpattern=0. On the non-working run I don’t see any errors, the only difference I see is that there are much less blocks written. Both workstations have over 450gb of free space.

Results from non-working workstation:
[i]-- Total: -----------------------------------------------------------
88 CHUNK 30064771072(7340032 blks) ==> 24233020(5916 blks)

done.
system.img built successfully. [/i]

Results from working workstation:
[i]-- Total: -----------------------------------------------------------
2800 CHUNK 30064771072(7340032 blks) ==> 3376493404(824331 blks)

done.
system.img built successfully.
[/i]

Does anybody have any ideas of what could be causing this difference? Thanks!

mksparse avoids saving information of empty parts of a file system. Assuming the same fill is used, I would have to conclude the content in the “rootfs/” subdirectory differs.

During the creation of the loopback mountable file system an empty file is created of the size desired. This is independent of any content in the image and would remain constant so long as the host hasn’t run out of disk space.

Next the file system is covered by loopback so block device tools can be used on it. This is the first stage where failure is common.

The first reason loopback would fail is if user root was not used to perform the loopback coverage…regular users have restrictions. Often the loopback device does not exist, and only root can create such a device.

The second reason for common loopback failure is a bug in the flash.sh script. If your system already has “/dev/loop0”, then the script assumes it can use this…“losetup --find” is never used unless “/dev/loop0” does not exist…if something else (such as an encrypted file system) is using loop0 then covering by loopback will fail.

If for any reason loopback does not cover the file, then the file will be blank.

Then the loopback device is formatted as ext4. If one host has different default options to mkfs.ext4, then the two will differ. If both systems have the same exact version of Ubuntu, then it stands to reason this won’t be the issue. You could look at “/etc/mke2fs.conf” on each to see for sure if the ext4 options are identical. An example might be a different block size or 64-bit option causing a difference.

Once loopback covers this, then flash edits some boot files in the “rootfs/boot/” subdirectory (this depends on flash.sh options), and recursively copies those files into the loopback image. If the options to flash.sh differed, then the sparse image will differ but the raw image will be identical in size (sha1sum would differ since some “/boot” files would differ). Unless root did the recursive copy the permissions would differ, but this would not change file system size unless a permission denied issue cause a copy to completely fail.

The odds are that on the machine where flash failed there will be a difference in what is in the “rootfs/” subdirectory. One example is if the disk space ran out and the file system truncated…you won’t get an error for this, it’ll happily work with whatever is there. Probably the first thing I’d do is look at the output of “df -H” on working and non-working systems and see if there is plenty of spare disk left.

Taking a checksum of a large file takes a long time, but it is actually reasonable to take an md5sum of a raw image. You could compare the md5sum of raw images. Be warned that if a raw image has ever been loopback mounted after its creation is completed, then metadata changes will imply a different md5sum even if the files themselves are exact matches. The process of copying the raw image to sparse or flashing the sparse image does not loopback mount the image, so this is safe to say that the actual flash/transfer/copy of the image to the Jetson does not alter anything in any way.

As an “acid test” consider that flashing using system.img in a sparse format or in a raw format is transparent…both work equally well…it’s just that a raw image takes much longer to transfer over a slow USB2 cable. If you use the “-r” option on flash.sh to “re-use” the “bootloader/system.img” file, then you could copy the system.img from a working machine to the failing machine and flash with this “-r” option to tell it to not overwrite system.img and just use what is there. You could also take the “system.img.raw” from a working host and copy this to the failing host as “system.img”, use “-r” and be guaranteed of an exact image which has bypassed mksparse. Should the image from a working system succeed on the failing host then you are guaranteed it is the content of the system.img.raw or the system.img which is at issue, which in turn implies rootfs differences need to be closely examined.

Thanks linuxdev! That was a lot of good information. My problem ended up being rootfs as you thought. It was not getting fully extracted from my repository correctly due to permissions issues and the extraction was failing silently.