Image scraping from Xavier L4T 32.5.1 -- host USB differences

We are trying to copy to the host the APP image from a Xaiver. We need to deploy this image to production. APP image copying works fine using an HP Z440 tower. However, if we use a new Dell Precision 5280 the copy fails with the message:
USB communication failed.Check if device is in recovery

Other than the host difference, there is no change in hardware or scripting. L4T 32.5.1. We are using a custom carrier board. lsusb shows the Xavier is in recovery mode when the failure occurs.

grep indicates this failure message is being thrown from a Nvidia binary:

$ grep -EHnR --exclude-dir=rootfs "USB communication failed" ./
Binary file ./bootloader/tegrarcm_v2 matches

Any suggestions on how to address this failure such that we can use a modern host PC to copy a heavily customized APP image?

Shell command:
sudo ./flash.sh -r -k APP -G backup.img jetson-xavier mmcblk0p1

Here is the flash.sh console output:
flash_sh_dell_precision_5820_fail.txt (37.3 KB)

It might be truly a USB issue, and what @andrewturner8118 mentions would be the right thing to start with.

I do see something interesting in the log though at the end:

*** The [APP] has been read successfully. ***
	Converting RAW image to Sparse image... 

Was this from the failed clone? Beware that a clone uses a lot of disk space. There are two clones from every clone: One is the raw clone (the exact size of the rootfs partition), and the other is the sparse clone (which is basically the size of the content on the partition), and as the partition fills up, the sparse size approaches the raw size. You can essentially require up to twice the partition size in disk space on the host PC. That last message is due to first cloning the raw image (full partition size), and then creating a sparse version of the raw clone. I don’t see any message indicating that the sparse clone ever completed. Maybe there wouldn’t be another message? Not sure. Probably though, that if there should be more messages (maybe that was a successful case and not the failure case log?), and if message end right at that point due to failure, then it is due to disk space running out on the host PC.

You can find available space on all partitions via “df -H -T” (you don’t need the “-T”, but I use it anyway because sometimes the wrong filesystem type can be a problem). You can find the space at a particular location if you name that location; for example, to see the space available in the parent directory of the default clone locations:
df -H -T ~/nvidia/nvidia_sdk

If you did not run out of disk space, then it is indeed a USB issue. You need to verify disk space though before you move on to USB. USB issues can be a pain to track since it might be a signal quality problem on otherwise perfectly functioning hardware (signal quality changes with combinations of hardware and cables).

@linuxdev

Yes, the log is from the failed cloning attempt.

USB on both DELL and HP machines work fine with numerous other USB devices. We have multiple, brand new DELL Precision 5280 towers running into this image scraping problem. This isn’t isolated to a single DELL tower. We have multiple “rigs” set up for development and each DELL Precision 5280 tower hits this issue.

Disk space isn’t a concern or an issue – these machines have TB’s of free disk space, large swap space and >= 64 GB of RAM . We understand the disk (and server) space needed to capture an image (and the length of time needed to deploy it).

We keep EVERYTHING the same, swap in the HP Z440 and image scraping works – every single time.

We keep EVERTHING the same, swap in the DELL Precision 5280 and image scraping fails – every single time.

It’s not a cabling issue. It’s not a BIOS issue.

Both the DELL and HP are running the same Ubuntu distro and release – Ubuntu 20.04.

Our scripts are kept in git source control – we have deployed IDENTICAL scripts to both machines.

The USB failure is reported from an Nvidia binary – which is closed source too and we cannot fix. The trail leads to Nvidia binary “tegrarcm_v2”.

The log message about the APP being read successfully is a red-herring.

We can put the Xavier into recovery and flash them all-day-long with the DELL Precision 5280. This confirms USB (and cabling) is fine. The problem is only encountered when trying to read the APP sector back to generate an image, which we need for deployment.

Unfortunately, we don’t have a USB analyzer in the shop to get a trace when the failure happens.

If I find time, I’ll try and run “tegrarcm_v2” through ghidra to decompile it. Maybe…

It could be a USB autosuspend issue.

Make sure /sys/module/usbcore/parameters/autosuspend contains a -1 value on the host machine.

@Kangalow

Both machines are using 3.1 (blue) USB ports and have the same default kernel value for autosuspend:

$ cat /sys/module/usbcore/parameters/autosuspend 
2

As “lsusb” reports the Xavier in recovery immediately when the script fails, it seems very unlikely this a USB autosuspend (power save) issue. I can run an experiment to try this, but that won’t be until next week.

1 Like

It is still possibly autosuspend. You can “echo -1 > /sys/module/usbcore/parameters/autosuspend” to temporarily disable this.

So far as the raw and sparse image files go, after the failure, from the “Linux_for_Tegra/bootloader/” directory, what exact sizes do you see from:

ls -l system.img*

(there should be “system.img” and “system.img.raw”)

After the failure, there are no system image files. I have plenty of disk space on this DELL Precision tower.

x@5:~/xxx/32.5.1/Linux_for_Tegra$ ls -l system.img*
ls: cannot access ‘system.img*’: No such file or directory

x@5:~/xxx/32.5.1/Linux_for_Tegra$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p3 3.6T 2.5T 980G 72% /

@linuxdev

Many thanks. We have a BINGO. Disabling USB autosuspend is a fix. Below is the proof.

Maybe the newer DELL Precision USB hardware responds faster (or with more accuracy) to detect USB state changes?

I’ll look at a “smaller hammer” approach to see if I can implement a udev rule and deploy it to these machines. I should be able to toggle power control to “on” from “auto” when the Nividia device comes up in recovery mode.

> x@:~/xxx/32.5.1/Linux_for_Tegra$ sudo su
> root@5:/xxx/32.5.1/Linux_for_Tegra# cat /sys/module/usbcore/parameters/autosuspend
> 2
> root@5:/xxx/32.5.1/Linux_for_Tegra# echo -1 > /sys/module/usbcore/parameters/autosuspend
> root@5:/xxx/32.5.1/Linux_for_Tegra# cat /sys/module/usbcore/parameters/autosuspend
> -1
> x@g5:~/xxx/32.5.1/Linux_for_Tegra$ sudo ./flash.sh -r -k APP -G backup.img jetson-xavier mmcblk0p1
> ###############################################################################
> # L4T BSP Information:
> # R32 , REVISION: 5.1
> ###############################################################################
> # Target Board Information:
> # Name: jetson-xavier, Board Family: t186ref, SoC: Tegra 194, 
> # OpMode: production, Boot Authentication: NS, 
> # Disk encryption: disabled ,
> ###############################################################################
> 
> // snip
> 
> [   9.2254 ] tegrarcm_v2 --boot recovery
> [   9.2275 ] Applet version 01.00.0000
> [   9.7908 ] 
> [  10.7945 ] tegrarcm_v2 --isapplet
> [  11.3540 ] 
> [  11.3562 ] tegrarcm_v2 --ismb2
> [  11.9181 ] 
> [  11.9200 ] tegradevflash_v2 --iscpubl
> [  11.9221 ] Bootloader version 01.00.0000
> [  12.2022 ] Bootloader version 01.00.0000
> [  12.2033 ] 
> [  13.2069 ] tegrarcm_v2 --isapplet
> [  13.7700 ] 
> [  13.7721 ] tegrarcm_v2 --ismb2
> [  14.3341 ] 
> [  14.3361 ] tegradevflash_v2 --iscpubl
> [  14.3382 ] Bootloader version 01.00.0000
> [  14.6182 ] Bootloader version 01.00.0000
> [  14.6193 ] 
> [  14.6194 ] Reading partition
> [  14.6215 ] tegradevflash_v2 --read APP /xxx/32.5.1/Linux_for_Tegra/backup.img
> [  14.6236 ] Bootloader version 01.00.0000
> [  14.8982 ] [................................................] 100%
> [ 944.1065 ] 
> *** The [APP] has been read successfully. ***
> 	Converting RAW image to Sparse image...
> x@5:~/xxx/32.5.1/Linux_for_Tegra$ ls -lth *img
> -rwxr-xr-x 1 root root 19G Nov 27 09:07 backup.img

It is possible for one system to have a more responsive reply to USB than another system. Certainly the sleep timeout could differ. I don’t know if that is the case. USB handling is somewhat inconsistent across operating systems and across hardware on any single o/s. Glad it worked out so easily though!