Thanks for the elaborate information.
Honestly, that is too much error to completely recover. If I were working on this, then what I’d do is clone in recovery mode, throw away the sparse clone, and then attempt to manually repair a loopback mounted raw clone copy on the host PC. You could save as much as possible from that and recreate it for install in any number of ways. The forum thread is getting longer, can you confirm what disk layout you have on this? For example, does it just boot to the internal eMMC, or is there any kind of external boot media involved?
This boots to the internal eMMC where RFS is mounted.
It has a external sd card and NVMe drive as well.
Note:
Also the unit is at customer site , where we don’t have access to the unit and we want to fix it without reflashing.
Even setting up host pc at customer site is also difficult …
Are the NVMe and SD card simply mounted somewhere as auxiliary disks? Assuming it is the eMMC which must be recovered, about the only hope of doing this without flash is if it can boot to at least command line. One could then rsync
to the SD card or NVMe if you don’t have anything there you care about, and attempt to fix the eMMC. I doubt this is practical though, and once again, I think all you would do is to save some of the disk but still have to flash again.
Correct cloning requires the device in recovery mode, and this in turn means the Jetson becomes a custom USB device and won’t have any access other than through your host PC.
I know that this is a dilemma with no good outcomes whether you have to go there and flash or whether you get the unit sent to you (e.g., overnight express mail), but without a booted unit (text mode is ok if networking is present, or if you have an external disk) there really is no chance.
How can fix this by doing rsync between external NVMe or sd card with internal emmc.
Please elaborate
If we press alt +F2 when the booting stops in the middle, it enters console mode.
After this what to be done to fix the booting issue.
I’ll try to answer this last question first. If alt+F2 works, then the system did boot. It is just the graphics which failed. I’m assuming it didn’t drop into a bash
shell directly, and it required you to log in. Did you have to log in for access after the alt+F2? If so, then that is a good thing.
As far as rsync
goes, this is not necessarily a fix. It is a mere chance, and the fix might have flaws in it. Furthermore, it might be that you can recovery data like this, but not actually restore (I’ll say more on that below). How much empty space is left on the external devices? Does one of them have as much empty space as that which is used in the eMMC that you have issues on? Check the output of this for disk space information (this limits to partitions with ext4
formatting, and this is mandatory…you cannot back up like this to other filesystem types if they are not a native Linux filesystem type):
df -H -T -t ext4
(please show the output of this)
Corruption in the partition is a flaw in the tree structure of the data nodes. There is a kind of linked list structure whereby a chunk of information is memorized on the disk in one “node”; that node has addresses to parent and child nodes. By this method one can save a chunk, append it by changing pointers, so on. A file has a head node that is the directory. A file has other nodes to what is in it, for example, the text of a text file. The final node points to NULL as the tail of the chain of nodes. Somewhere your nodes are incorrect, and files and directories may point to something to edit when changing them that they shouldn’t be looking at. The result of writing to anything might be further corruption whereby other unrelated content is destroyed. Thus the system won’t let you write to it due to protection against further corruption.
What you can do is read the data. There are two ways to read it; the first method is to read it as files and directories. You can use rsync
to take what is there and copy it to a new partition, and since the partition has its own tree of nodes which are properly set up, then creating that old content onto the new tree will do so without improperly linked nodes. The down side: Some of the content will be wrong and might be from a cross link to a different file or directory. An example might be that a binary file has text in the middle of it from a text document. The new content would never corrupt further, but the old content which is already missing or incorrect would remain missing or incorrect. You could write that content onto a newly formatted eMMC and “hope” for whatever is missing or wrong to not matter. You might not find out for a long time the extent of the damage.
The second method is to copy the partition as binary data. This is the realm of dd
instead of using rsync
. This is how I do disk recovery operations on a failing disk, and this also won’t fix corrupted or missing content, but it gives you more power in recovering missing files and data content (there are special tools). Doing this kind of recovery directly off of a failing disk (failing hardware) is risky, and if you use dd
first so that you can work on a copy, the results are much much better. In your case you do not have a failing eMMC, it is just corrupt data. Still, if you ever want a copy of what is there that has the highest chance of recovery through more specialized tools, then this is pretty much mandatory. There is a minor possibility of recovering everything with a lot of work on a dd
image of the partition (which takes a bigger learning curve and more time). The thing of this is that the same dd
partition can also have rsync
performed from that the same as the rsync
from the actual eMMC. If you have the dd
partition, then you can still use other methods. The rsync
method has no further options, you get what you get.
I will add that the tools which can automatically attempt recovery work on the loopback mounted dd
partition from a separate host PC. Everything can be done on this from another computer. You can obtain the dd
over the network from the Jetson to your host PC at a remote location, or you can dd
to the local NVMe or SD card. dd
won’t care about the filesystem type on the SD card. rsync
can go to the local SD card if and only if the filesystem type on the SD card is a Linux type (e.g., ext4 will work, but VFAT or NTFS will not).
It takes a long time on any slow network to dd
or rsync
over to another computer. rsync
is faster because it can compress. If you have the time though, dd
is a better result IMHO.
I’m showing you a lot of pros and cons, and have not really answered your question. All of the above is to get the original data, and you cannot fix this without the original data. Once you have suitable data, then you probably have to format the original eMMC partition as ext4 and then restore data on it. This can have some really high probabilities of failure to write to a partition that is actually in use. There are all kinds of things that can get in the way or go wrong. If you have the data ahead of time, then it won’t matter what goes wrong, you still have data to try with again.
How important is the data? How fast is the network between your systems? What space is consumed on the eMMC, and how much empty space is there on the NVMe? How large is the SD card, and is it formatted as ext4
? There is a lot you can do from command line after a normal alt+F2, and a lot more if you have networking. It is faster to copy that data to a local NVMe. but then you might have to put the NVMe on your local computer to work on it. Networking lets you directly copy to your host PC. Describe what resources you have for networking and host PC (including local and/or remote).
Yes. I suppose as per customer updates
that data on eMMC where Root FS is mounted and flashed. It has all the other softwares Jetpack components installed needed by the customer.
Network option is ruled out as we cannot have network access as it is very secured. Only option left is local copy to external SD card or NVMe M.2 drive.
eMMC is 64 gb, SD card is 64 gb, NVMe is 500 GB.
Both SD card and NVMe are formatted to Ext4 I suppose.
However, currently we are trying to fix this issue, by removing the SD card and NVMe physically from the carrier boards and try booting and see, if it boots correctly.
Next option is to go the customer site with a host pc setup on a laptop and try reflashing. I will let you know more about this in coming days, how the debugging/trouble shooting steps progress.
Thanks a lot for your precious thorough elaborative information.
The NVMe will be your destination in this case. If when complete the file created is small enough to fit on SD, then you could put it there as well. Networking to retrieve the file is an option if you get permission, but being there physically would be a lot faster.
I will state ahead of time that when a filesystem is corrupt, that copy via rsync
may stop. It is a hit and miss test. The dd
method cannot be used without bad things happening if the filesystem is in use, although it might be an option if you shut down any known running programs and operate only from the NVMe. Cloning from a recovery mode Jetson while being physically present is the superior method, and is the same result as a dd
clone if the clone is of a read-only or unmounted filesystem. If things go badly, then we might be able to find a way to manually remount the rootfs read-only while keeping the NVMe read/write.
This is the easiest thing to do that follows, and does not require you to be there. ssh
login is fine so long as you have sudo
access. Test that out with something like “sudo ls
”. To drop into a root shell you can run “sudo -s
”. You’ll want to be in the root shell for what follows.
Go to the NVMe. Be certain it has more than 64 GB remaining space (“df -H -T .
” to see that location’s content because “.
” is an alias for “the current directory”). Find an empty directory. I suggest you create one like “clone1
” for the first clone attempt. This will take significant time, so if you like coffee, I suggest have some handy.
We will call the directory where the clone is destined on the NVMe to be “/clone1
”. More likely it is some other subdirectory, but use your imagination. This is the directory of destination. This will contain a mirror of the existing eMMC without tree corruption, but it will be affected by any errors in the original tree. You won’t need to worry about access to this location having further corruption if you write to it or read. An image can be made from this. The flash content on a host PC can be updated to flash this content if you wish, or to even put edited parts of this onto the flash content of the host PC when flash time comes (meaning the flash will give you back a working system with your software running on it).
A very useful and important not about one rsync
option that is a safety. You can use “--dry-run
”, and you will see all operations as they would happen if this were a real run. Nothing will be done. If you do this and it looks right, then you can remove “--dry-run
” and actually make this happen. You could discover things like filesystem errors getting in the way, but more importantly, you could find it is pulling or place files in the wrong place, or running into permission issues. I might show a command twice, once with --dry-run
, once without.
sudo -s
cd /clone1
- Substitute any “
/clone1
” with your location. - Version 1, with
--dry-run
:
rsync --dry-run -avcrltxAP --info=progress2,stats2 --numeric-ids --exclude '.gvfs' --exclude 'lost+found' '/' /clone1
One important thing to know is that the “-x
” option says to not cross filesystems. The mount point of the SD card will be ignored (the mount point would be recorded, the content not copied).
The reason we don’t include the “lost+found
” is that this is part of any ext4
partition. However, because your destination location is a subdirectory, and not a mount point, this means your own system won’t have a lost+found/
subdirectory within your subdirectory. You could in fact copy lost+found/
as well. This location is reserved for content which filesystem repair has trimmed and removed…it is the destination of node “pruning”. I am going to go ahead in the next version of this and suggest you go ahead and run this without exclusion of lost+found/
since it could be useful if anything has already been pruned, and won’t harm anything since you are in a subdirectory. This version might allow you to better recover and know what was lost:
- Version 2, with
--dry-run
:
rsync --dry-run -avcrltxAP --info=progress2,stats2 --numeric-ids --exclude '.gvfs' '/' /clone1
It is important to know if there is a location that you do not want copied, then you can use “--exclude '/some/where'
”, and then dry run to see if it is what you want. rsync
is fairly reliable so I don’t expect any significant errors at this point. We exclude .gvfs
because it is a pseudo-filesystem and not part of the disk, and although this would not be crossed (it is a filesystem boundary) it will give you a lot of errors since it is security based and won’t allow users to read it.
Here is the final suggested version. I will add logging to this so you have a record of all that happened and all that failed.
- Final version, with logging and no
--dry-run
:
rsync --dry-run -avcrltxAP --info=progress2,stats2 --numeric-ids --exclude '.gvfs' '/' /clone1 2>&1 | tee log_rsync.txt
The log will be log_rsync.txt
.
If the devices themselves do not error, this should do the job. We can talk about how to use this to create a new system. It is quite difficult to do so on a running system, but there can be compromises (not necessarily acceptable, but maybe acceptable) to use rsync
on a repair of that image.
Let us know if you have the files of the existing system on the NVMe, and also if you are able to access a host PC there for flash. Flash can use that image.
Just a reminder, if you are going to end up there in person, then a clone via recovery mode would be a superior method. Even so, it is good to have the rsync
backup. Having both gives you an enormous amount of room to rescue whatever remains intact.
We received unit which was stopping in the middle.of.the boot due to corrupted file system and partition table.
Customer had used fdisk, mount, unmount commands unknowningly with out thinking the repercussions of it which resulted in corruption of partitioning table etc
We. fixed the issue by reflashing again from our host pc and installing all the softwares once again.
You might want to set up an rsync
script for backup (which works on a running unit) in the future. The customer could save a lot of time that way. Once corrupted you have to clone and repair the clone. An rsync
backup (if complete and preserving numeric IDs and permissions) can be used to create a flash image, or simply to update the sample image to your image.
This time used.backup and restore script and took back of the complete eMMC memory where RFS reside, so that next time, if unit goes bad
Will.use restore command using same.script and bring back the unit in short time.
Please provide rsync script setup.what commands to be done.
Also tell us, If unit stops booting in the middle, how to restore the cloned image. back in console.mode?
I don’t know the specifics for your case until it occurs. First I’ll suggest some information on rsync
topics:
- https://forums.developer.nvidia.com/t/topic/285222/17
- https://forums.developer.nvidia.com/t/topic/274197/23
- https://forums.developer.nvidia.com/t/topic/276071/2
- https://forums.developer.nvidia.com/t/topic/275386/46
- https://forums.developer.nvidia.com/t/topic/236295/5
I actually have too many posts on the topic to zero in on one. You can skim those though and find which ones offer command examples. Concentrate first on clone or backup. Do realize that some of the articles are for different models of Jetsons.
Always keep in mind these rsync
options:
--dry-run
This allows you to see what a command would do without really doing it. Makes a lot of testing safe.--numeric-ids
This is mandatory for copy of content from one Linux system to another if you want to keep ownership the same. Users and groups are really known by their alias, such as the user name, but the true identifier is the numeric ID. This option allows one to back up and restore the ID, not just an alias.--delete-before
If you don’t have a lot of backup space, or if you don’t care about special failure cases (such as loss of power during the backup), then this can reduce the peak amount of disk space during the actual backup. With this, if a file is modified and going to be copied, then at the destination the old content is erased before copy; without this, the new file is added with a temp name, and then when complete, the original is overwritten.- Many options are just to preserve permissions. There is overlap and it usually doesn’t hurt.
- Jetsons don’t normally use extended security attributes (ACLs, or Access Control Lists from SElinux), but if you do have this, then the
-X
option preserves this. Might not matter if your filesystem is not set up for ACLs. --info=progress2,stats2
This is just to see progress. Backups can take a long time, and it adds anxiety if you don’t know what is going on.- Local or remote accounts over
ssh
are usually interchangeable. If you use-e ssh
, then these examples offer destinations or sources:/home/someone
someone@remotehost.com:/home/someone
- If you set up
ssh
keys, then command linessh
withrsync
is trivial effort. - Sometimes options require root authority, e.g.,
--numeric-ids
at the end which writes a file using a numeric ID. I usually unlock root login overssh
with keys only on Ubuntu since it is much much simpler than playing with the options to do so withsudo
. In this case your root user would have a public/private key pair, and the only login is viasudo
or viassh
using key pairs (password login is still prohibited and keys can be revoked).
Restoring directions depend on how the backup was created. There isn’t one easy answer. If you mean console mode from a normal login, then that is usually trivial since the system is running (anything in boot stages immediately makes the problem much more difficult). If you mean a rescue mode, then this is not easy. An exception might exist if you’ve customized boot to allow a bash
shell with networking and ssh
. Otherwise you’re back to flashing instead of direct restore. When flashing there are a number of ways to use backed up content. This is an entire industry on its own, so you’d have to ask when specifics are known.
I highly recommend using a Jetson you don’t depend on to practice backup and restore. Having actually tested what you know is incredibly useful.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.