Can we use Jetson AGX Xavier as host pc for flashing?

Thanks for the elaborate information.

Honestly, that is too much error to completely recover. If I were working on this, then what I’d do is clone in recovery mode, throw away the sparse clone, and then attempt to manually repair a loopback mounted raw clone copy on the host PC. You could save as much as possible from that and recreate it for install in any number of ways. The forum thread is getting longer, can you confirm what disk layout you have on this? For example, does it just boot to the internal eMMC, or is there any kind of external boot media involved?

This boots to the internal eMMC where RFS is mounted.

It has a external sd card and NVMe drive as well.

Note:
Also the unit is at customer site , where we don’t have access to the unit and we want to fix it without reflashing.

Even setting up host pc at customer site is also difficult …

Are the NVMe and SD card simply mounted somewhere as auxiliary disks? Assuming it is the eMMC which must be recovered, about the only hope of doing this without flash is if it can boot to at least command line. One could then rsync to the SD card or NVMe if you don’t have anything there you care about, and attempt to fix the eMMC. I doubt this is practical though, and once again, I think all you would do is to save some of the disk but still have to flash again.

Correct cloning requires the device in recovery mode, and this in turn means the Jetson becomes a custom USB device and won’t have any access other than through your host PC.

I know that this is a dilemma with no good outcomes whether you have to go there and flash or whether you get the unit sent to you (e.g., overnight express mail), but without a booted unit (text mode is ok if networking is present, or if you have an external disk) there really is no chance.

How can fix this by doing rsync between external NVMe or sd card with internal emmc.
Please elaborate

If we press alt +F2 when the booting stops in the middle, it enters console mode.
After this what to be done to fix the booting issue.

I’ll try to answer this last question first. If alt+F2 works, then the system did boot. It is just the graphics which failed. I’m assuming it didn’t drop into a bash shell directly, and it required you to log in. Did you have to log in for access after the alt+F2? If so, then that is a good thing.

As far as rsync goes, this is not necessarily a fix. It is a mere chance, and the fix might have flaws in it. Furthermore, it might be that you can recovery data like this, but not actually restore (I’ll say more on that below). How much empty space is left on the external devices? Does one of them have as much empty space as that which is used in the eMMC that you have issues on? Check the output of this for disk space information (this limits to partitions with ext4 formatting, and this is mandatory…you cannot back up like this to other filesystem types if they are not a native Linux filesystem type):
df -H -T -t ext4
(please show the output of this)

Corruption in the partition is a flaw in the tree structure of the data nodes. There is a kind of linked list structure whereby a chunk of information is memorized on the disk in one “node”; that node has addresses to parent and child nodes. By this method one can save a chunk, append it by changing pointers, so on. A file has a head node that is the directory. A file has other nodes to what is in it, for example, the text of a text file. The final node points to NULL as the tail of the chain of nodes. Somewhere your nodes are incorrect, and files and directories may point to something to edit when changing them that they shouldn’t be looking at. The result of writing to anything might be further corruption whereby other unrelated content is destroyed. Thus the system won’t let you write to it due to protection against further corruption.

What you can do is read the data. There are two ways to read it; the first method is to read it as files and directories. You can use rsync to take what is there and copy it to a new partition, and since the partition has its own tree of nodes which are properly set up, then creating that old content onto the new tree will do so without improperly linked nodes. The down side: Some of the content will be wrong and might be from a cross link to a different file or directory. An example might be that a binary file has text in the middle of it from a text document. The new content would never corrupt further, but the old content which is already missing or incorrect would remain missing or incorrect. You could write that content onto a newly formatted eMMC and “hope” for whatever is missing or wrong to not matter. You might not find out for a long time the extent of the damage.

The second method is to copy the partition as binary data. This is the realm of dd instead of using rsync. This is how I do disk recovery operations on a failing disk, and this also won’t fix corrupted or missing content, but it gives you more power in recovering missing files and data content (there are special tools). Doing this kind of recovery directly off of a failing disk (failing hardware) is risky, and if you use dd first so that you can work on a copy, the results are much much better. In your case you do not have a failing eMMC, it is just corrupt data. Still, if you ever want a copy of what is there that has the highest chance of recovery through more specialized tools, then this is pretty much mandatory. There is a minor possibility of recovering everything with a lot of work on a dd image of the partition (which takes a bigger learning curve and more time). The thing of this is that the same dd partition can also have rsync performed from that the same as the rsync from the actual eMMC. If you have the dd partition, then you can still use other methods. The rsync method has no further options, you get what you get.

I will add that the tools which can automatically attempt recovery work on the loopback mounted dd partition from a separate host PC. Everything can be done on this from another computer. You can obtain the dd over the network from the Jetson to your host PC at a remote location, or you can dd to the local NVMe or SD card. dd won’t care about the filesystem type on the SD card. rsync can go to the local SD card if and only if the filesystem type on the SD card is a Linux type (e.g., ext4 will work, but VFAT or NTFS will not).

It takes a long time on any slow network to dd or rsync over to another computer. rsync is faster because it can compress. If you have the time though, dd is a better result IMHO.

I’m showing you a lot of pros and cons, and have not really answered your question. All of the above is to get the original data, and you cannot fix this without the original data. Once you have suitable data, then you probably have to format the original eMMC partition as ext4 and then restore data on it. This can have some really high probabilities of failure to write to a partition that is actually in use. There are all kinds of things that can get in the way or go wrong. If you have the data ahead of time, then it won’t matter what goes wrong, you still have data to try with again.

How important is the data? How fast is the network between your systems? What space is consumed on the eMMC, and how much empty space is there on the NVMe? How large is the SD card, and is it formatted as ext4? There is a lot you can do from command line after a normal alt+F2, and a lot more if you have networking. It is faster to copy that data to a local NVMe. but then you might have to put the NVMe on your local computer to work on it. Networking lets you directly copy to your host PC. Describe what resources you have for networking and host PC (including local and/or remote).

Yes. I suppose as per customer updates

that data on eMMC where Root FS is mounted and flashed. It has all the other softwares Jetpack components installed needed by the customer.

Network option is ruled out as we cannot have network access as it is very secured. Only option left is local copy to external SD card or NVMe M.2 drive.

eMMC is 64 gb, SD card is 64 gb, NVMe is 500 GB.
Both SD card and NVMe are formatted to Ext4 I suppose.

However, currently we are trying to fix this issue, by removing the SD card and NVMe physically from the carrier boards and try booting and see, if it boots correctly.

Next option is to go the customer site with a host pc setup on a laptop and try reflashing. I will let you know more about this in coming days, how the debugging/trouble shooting steps progress.

Thanks a lot for your precious thorough elaborative information.

The NVMe will be your destination in this case. If when complete the file created is small enough to fit on SD, then you could put it there as well. Networking to retrieve the file is an option if you get permission, but being there physically would be a lot faster.

I will state ahead of time that when a filesystem is corrupt, that copy via rsync may stop. It is a hit and miss test. The dd method cannot be used without bad things happening if the filesystem is in use, although it might be an option if you shut down any known running programs and operate only from the NVMe. Cloning from a recovery mode Jetson while being physically present is the superior method, and is the same result as a dd clone if the clone is of a read-only or unmounted filesystem. If things go badly, then we might be able to find a way to manually remount the rootfs read-only while keeping the NVMe read/write.

This is the easiest thing to do that follows, and does not require you to be there. ssh login is fine so long as you have sudo access. Test that out with something like “sudo ls”. To drop into a root shell you can run “sudo -s”. You’ll want to be in the root shell for what follows.

Go to the NVMe. Be certain it has more than 64 GB remaining space (“df -H -T .” to see that location’s content because “.” is an alias for “the current directory”). Find an empty directory. I suggest you create one like “clone1” for the first clone attempt. This will take significant time, so if you like coffee, I suggest have some handy.

We will call the directory where the clone is destined on the NVMe to be “/clone1”. More likely it is some other subdirectory, but use your imagination. This is the directory of destination. This will contain a mirror of the existing eMMC without tree corruption, but it will be affected by any errors in the original tree. You won’t need to worry about access to this location having further corruption if you write to it or read. An image can be made from this. The flash content on a host PC can be updated to flash this content if you wish, or to even put edited parts of this onto the flash content of the host PC when flash time comes (meaning the flash will give you back a working system with your software running on it).

A very useful and important not about one rsync option that is a safety. You can use “--dry-run”, and you will see all operations as they would happen if this were a real run. Nothing will be done. If you do this and it looks right, then you can remove “--dry-run” and actually make this happen. You could discover things like filesystem errors getting in the way, but more importantly, you could find it is pulling or place files in the wrong place, or running into permission issues. I might show a command twice, once with --dry-run, once without.

  • sudo -s
  • cd /clone1
  • Substitute any “/clone1” with your location.
  • Version 1, with --dry-run:
rsync --dry-run -avcrltxAP --info=progress2,stats2 --numeric-ids --exclude '.gvfs' --exclude 'lost+found' '/' /clone1

One important thing to know is that the “-x” option says to not cross filesystems. The mount point of the SD card will be ignored (the mount point would be recorded, the content not copied).

The reason we don’t include the “lost+found” is that this is part of any ext4 partition. However, because your destination location is a subdirectory, and not a mount point, this means your own system won’t have a lost+found/ subdirectory within your subdirectory. You could in fact copy lost+found/ as well. This location is reserved for content which filesystem repair has trimmed and removed…it is the destination of node “pruning”. I am going to go ahead in the next version of this and suggest you go ahead and run this without exclusion of lost+found/ since it could be useful if anything has already been pruned, and won’t harm anything since you are in a subdirectory. This version might allow you to better recover and know what was lost:

  • Version 2, with --dry-run:
rsync --dry-run -avcrltxAP --info=progress2,stats2 --numeric-ids --exclude '.gvfs'  '/' /clone1

It is important to know if there is a location that you do not want copied, then you can use “--exclude '/some/where'”, and then dry run to see if it is what you want. rsync is fairly reliable so I don’t expect any significant errors at this point. We exclude .gvfs because it is a pseudo-filesystem and not part of the disk, and although this would not be crossed (it is a filesystem boundary) it will give you a lot of errors since it is security based and won’t allow users to read it.

Here is the final suggested version. I will add logging to this so you have a record of all that happened and all that failed.

  • Final version, with logging and no --dry-run:
rsync --dry-run -avcrltxAP --info=progress2,stats2 --numeric-ids --exclude '.gvfs'  '/' /clone1 2>&1 | tee log_rsync.txt

The log will be log_rsync.txt.

If the devices themselves do not error, this should do the job. We can talk about how to use this to create a new system. It is quite difficult to do so on a running system, but there can be compromises (not necessarily acceptable, but maybe acceptable) to use rsync on a repair of that image.

Let us know if you have the files of the existing system on the NVMe, and also if you are able to access a host PC there for flash. Flash can use that image.

Just a reminder, if you are going to end up there in person, then a clone via recovery mode would be a superior method. Even so, it is good to have the rsync backup. Having both gives you an enormous amount of room to rescue whatever remains intact.