Tegraflash_internal.py do not erase partition before writing

I’m using Jetpack 4.5.1 with TX2.
I was trying to flash just APP partition using
./flash.sh -r -k APP jetson-tx2 mmcblk0p1
And got a lot of ext4 errors while booting kernel.
So I have started to investigate:

  1. Flash whole tx2: ./flash.sh jetson-tx2 mmcblk0p1
  2. Interrupt boot at U-boot
  3. Dump APP partition: ./flash.sh -r -k APP -G dump jetson-tx2 mmcblk0p1
  4. It matches with bootloader/system.img
  5. Flash just APP: ./flash.sh -r -k APP jetson-tx2 mmcblk0p1
  6. Interrupt at u-boot and dump. Still matches bootloader/system.img
  7. Boot linux and finish setup
  8. Re-flash just APP: ./flash.sh -r -k APP jetson-tx2 mmcblk0p1
  9. Dump APP - DO NOT MATCH bootloader/system.img
  10. Add tegraflash_erase_partition(partition_name) before tegraflash_write_partition in bootloader/tegraflash_internal.py:522
  11. Re-flash just APP, dump and validate - Now it matches bootloader/system.img

It looks like the erase step is missing. Should it be fixed?

There is no “specific” erase step (so far as I know it just overwrites), but it does seem like the APP partition should still match with the “-r -K APP” no matter how many times it is performed. I can’t confirm if this is what happens or not, but what would be of interest is if you were to clone the differing APP partition and compare it to the “bootloader/system.img.raw”. What I would compare:

  • Exact clone and system.img.raw sizes;
  • Extract the first 32 bytes or so at the front of the clone and system.img.raw and compare hexadecimal;
  • Extract the first 32 bytes or so at the end of the clone and system.img.raw and compare hexadecimal;
  • Loopback mount both and compare the sha1sum of all content in the “/boot” directory.

Note that from the “Linux_for_Tegra/” directory you can clone via:
sudo ./flash.sh -r -k APP -G my_backup.img jetson-tx2 mmcblk0p1

A clone produces both a sparse image ("my_backup.img" in the above) which is useless for most comparison purposes, and also produces a raw image ("my_backup.img.raw" in the above). This raw image should be an exact bit-for-bit match to the “bootloader/system.img.raw” used in the APP partition flash…but only if boot has never reached the Linux kernel and thus first boot changes don’t exist.

PS: You need a lot of spare disk space on the host PC to clone.

I didn’t bother to compare system.img.raw image, because sparse images are equal. I have checked them using sha256sum.
As i have stated before - I dumped and compared sparse images after each step, and after kernel has booted and modified partition, I could not get a match, even after reflashing APP.
I think this is due to how flashing sparse images works - cboot don’t flash zero part of sparse image because it assumes it is already zero.

Also when I added tegraflash_erase_partition(partition_name), sparse images begin to match original again(using sha256sum), even after kernel boot and reflash , so I have made myself a workaround=)

I don’t want to repeat all steps again to check if .raw image hashes are the same.
Let’s assume if hashes of sparse images are the same then hashes of .raw images will match too. I don’t think it’s possible to be otherwise, because sparse images are generated from “my_backup.img.raw”, which is read from TX2.
Other tests you proposed also are covered by hash-compare of sparce images, so I don’t think we get some new information from them.

Yes. The sparse image does not program large portions/blocks of data that are in the erased state. If you program a sparse image of the application to the partition, it assumes the eMMC has been properly erased. I can confirm your findings as I too have been burned by just doing the application partition reprogramming using the sparse image. Using the non-sparse image works (all but terribly slow), because it forces programming to the entire partition including those parts of the image that contain the erased state values.

The end content is equal for raw versus sparse, but the containing file is not. The reason I mention the raw file is that it allows comparing via a checksum (I use sha1sum, but it could be something faster, e.g., crc32). The clone would have to compress the same way as the host PC to guarantee a cloned sparse image is the same as a system.img sparse file…they could be the same, but I don’t know for sure, so I suggest comparing raw files instead. What we do know is that a partition which should be an exact match is not an exact match, so it is a search for what it should be not being what was expected. Comparison of the raw image is the acid test.

I do have to wonder why you would need to run tegraflash_erase_partition. It shouldn’t matter. I am interested because you are right that the flash procedures should always result in the partition being the same as the system.img (or system.img.raw when expanded). @JDSchroeder brings up a good point: Sparse images do not rewrite the unused space, so I am thinking your comparison might also be looking at what was blank space. It is a typical theme that the filesystem won’t actually write to erased files, but will simply mark the node unused. That would be an example of how a raw file and sparse file are not the same thing…when comparison of unused space matters.

That brings up another topic: How are you determining that the flashed content matches or does not match?

You’re giving mksparse too much credit and over estimating the complexity and smarts of the utility.

The NVIDIA mksparse does not operate on the file system structure. It simply looks for erased blocks (4096 bytes by default) and eliminates them from the programming sequence. So if you have any files (and you most definitely will with a multi-Gigabyte file system) that have large sections of erased zeros in them, then mksparse will eliminate the block from programming. So when your system boots, rather than have zeros in that section of the file, it will have random junk in there.

Here is a sequence of commands that shows inductively how mksparse works:

nvidia@desktop$ dd if=/dev/zero of=/tmp/test1.raw bs=4096 count=1000
nvidia@desktop$ dd if=/dev/urandom of=/tmp/test2.raw bs=4096 count=1000
nvidia@desktop$ dd if=/dev/urandom of=/tmp/test3.raw bs=4096 count=1000
nvidia@desktop$ dd if=/dev/zero of=/tmp/test3.raw bs=4096 seek=5 count=50 conv=notrunc
nvidia@desktop$ ./bootloader/mksparse /tmp/test1.raw /tmp/test1.img
nvidia@desktop$ ./bootloader/mksparse /tmp/test2.raw /tmp/test2.img
nvidia@desktop$ ./bootloader/mksparse /tmp/test3.raw /tmp/test3.img
nvidia@desktop$ ls -l /tmp/test*
-rw-r--r-- 1 nvidia nvidia 4096000 Apr 28 11:43 /tmp/test1.raw
-rwxr-xr-x 1 nvidia nvidia 40 Apr 28 11:45 /tmp/test1.img
-rw-r--r-- 1 nvidia nvidia 4096000 Apr 28 11:44 /tmp/test2.raw
-rwxr-xr-x 1 nvidia nvidia 4096040 Apr 28 11:45 /tmp/test2.img
-rw-r--r-- 1 nvidia nvidia 4096000 Apr 28 11:53 /tmp/test3.raw
-rwxr-xr-x 1 nvidia nvidia 3891264 Apr 28 11:53 /tmp/test3.img
nvidia@desktop$ xxd /tmp/test1.img
00000000: 3aff 26ed 0100 0000 1c00 0c00 0010 0000 :.&.............
00000010: e803 0000 0100 0000 0000 0000 c3ca 0000 ................
00000020: e803 0000 0c00 0000 ........

Notice none of the files are file system images, yet mksparse gladly removes the zero’d data from the image file it outputs.

If you are still unbelieving, the test is to create a file inside of your file system image that has a large sequence of zeros in it and check the file when you boot up your system. You will see the file no longer has all of those zeros in it because it was never programmed, even though it is part of the valid file system contents.

  1. ./flash.sh -r -k APP -G dump jetson-tx2 mmcblk0p1
  2. sha256sum dump
  3. compare result of “sha256sum dump” to “sha256sum bootloader/system.img”

Since “dump” - is a sparse image, which is generated from RAW image as a last step of “./flash.sh -G”, which reads the whole partition from TX2(generating the “dump.raw” file), I may safely assume that if hash-sum of dump(which is sparse image) matches hash-sum of bootloader/system.img that the whole partition on TX2 including zeroed blocks matches bootloader/system.img.raw

And since adding erase step worked as expected - I think erase step should added when writing sparse images to partitions

You are correct, but my thought was more or less concentrating on whether or not the application which reads the partition considers unused content. The subtle point I am thinking about is this: dd does not care about the underlying filesystem structure, and has no idea what ext4 is…to compare if two images are equal, one really needs to know if the ext4 is equal. I consider it a weakness in the tool if flashing does not zero the “non-content” parts of ext4, but this was actually done for a reason (speed of flash). You are perfectly correct, I just am looking at the behavior for different reasons.

@sshmarov: See the above. Your flash may in fact be a perfectly valid clone if you consider only the ext4 aware content. The dd tool looks at everything, but the reality is that only the ext4 part of that partition matters for the APP/rootfs partition. The test to see if this is what the issue is would be to use the failing flash method, but to copy the “bootloader/system.img.raw” file in place of “bootloader/system.img”. The reason for doing so is that this will contain the zero content and not just the ext4 content. Thus taking longer to flash, but being equivalent to manually flashing the zero content (to some extent flashing the raw image is equivalent to flashing the sparse content plus erasing). I am going to guess that since you are cloning and comparing sparse images that there is in fact a defect, but you won’t know unless you try working with raw images once without erase. If raw images also fail, then the bug differs from the case of when raw images work and only sparse images differ.