Hello There,
I am currently trying to scale up the flashing process to collect more data about the compute modules. During the mass flash process I stumbled across some flashing issues using init_rd massflash in Jetpack 5.1.2
Problem Isolation:
I am currently connecting 8-10 Jetson Xavier NX in recovery mode.
To tear down the core issue I made sure that…
- the generated massflash Image is generated for at least 10 Jetson devices
- my powersupplies are able to handle 10 booting Xavier NX simultaneously (I have tested that with 16 Booting Xavier NX)
- each Jetson has its own USB host controller for the flashing process
usb3 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:03:00.0) hub
3-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb4 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:03:00.0) hub
usb5 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:04:00.0) hub
5-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb6 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:04:00.0) hub
usb7 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:05:00.0) hub
7-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb8 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:05:00.0) hub
usb9 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:06:00.0) hub
9-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb10 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:06:00.0) hub
usb11 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:09:00.0) hub
11-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb12 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:09:00.0) hub
usb13 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:0a:00.0) hub
13-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb14 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:0a:00.0) hub
usb15 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:0b:00.0) hub
15-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb16 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:0b:00.0) hub
usb17 1d6b:0002 09 1IF [USB 2.00, 480 Mbps, 0mA] (xhci-hcd 0000:0c:00.0) hub
17-1 0955:7e19 00 1IF [USB 2.00, 480 Mbps, 32mA] (NVIDIA Corp. APX)
usb18 1d6b:0003 09 1IF [USB 3.00, 5000 Mbps, 0mA] (xhci-hcd 0000:0c:00.0) hub
- my Host system is able to handle that much USB Devices (I only noticed a CPU load on each core of 100% for about 1-2 seconds at the start and end of the massflash, due to USB mount and unmount operations)
- I start the process with the highest I/O priority as stated in the “README_initrd_flash.txt”
→ sudo ionice -c 1 -n 0 ./tools/kernel_flash/l4t_initrd_flash.sh --flash-only --network usb0 --massflash 10 - my issue appears also on a different host system (48 CPU Core Workstation & about 64GB of RAM), tested on Ubuntu 18.04 & 20.04
In order to get more information than “Flashing Failed” I modified the initrd_massflash script to collect more data. The issues below are also reproducible with the unmodified init_rd massflash script.
The errors seem to have something to do with unavailable storage devices. My Current assumption is that one of the flashing scripts is unmounting a Jetson, that is handled by another flashing process.
Failed flashing logs:
The following logs were collected on a Ubuntu 18.04 System (4 Core CPU & 8GB of RAM)
336+0 records in
336+0 records out
336 bytes copied, 0,0125908 s, 26,7 kB/s
Writing bpmp-fw-dtb_b partition done
writing item=55, 1:3:kernel_b, 15142551552, 67108864, boot.img, 43569152, fixed-<reserved>-12, a84d1802f9dc10564ecff0d3fcb06082c815142e
Writing kernel_b partition with boot.img
Get size of partition through connection.
blockdev: cannot open /dev/sdc12: No such file or directory
[ 1825]: l4t_flash_from_kernel: Get size of partition failed
[ 1825]: l4t_flash_from_kernel: Error flashing emmc
Error flashing non-qspi storage
Cleaning up...
452+0 records in
452+0 records out
452 bytes copied, 0,00270544 s, 167 kB/s
Writing recovery-dtb partition done
writing item=60, 1:3:RECROOTFS, 15328608256, 104857600, , , fixed-<reserved>-17,
[ 703]: l4t_flash_from_kernel: Warning: skip writing RECROOTFS partition as no image is specified
writing item=61, 1:3:esp, 15433465856, 67108864, esp.img, 67108864, fixed-<reserved>-18, 81add5846db4c52f28a11ba16df00871a06b70c2
Writing esp partition with esp.img
Get size of partition through connection.
blockdev: cannot open /dev/sdx18: No such file or directory
[ 703]: l4t_flash_from_kernel: Get size of partition failed
[ 703]: l4t_flash_from_kernel: Error flashing emmc
Error flashing non-qspi storage
Cleaning up...
336+0 records in
336+0 records out
336 bytes copied, 0,00240753 s, 140 kB/s
Writing bpmp-fw-dtb_b partition done
writing item=55, 1:3:kernel_b, 15142551552, 67108864, boot.img, 43569152, fixed-<reserved>-12, a84d1802f9dc10564ecff0d3fcb06082c815142e
Writing kernel_b partition with boot.img
Get size of partition through connection.
blockdev: cannot open /dev/sdb12: No such file or directory
[ 855]: l4t_flash_from_kernel: Get size of partition failed
[ 855]: l4t_flash_from_kernel: Error flashing emmc
Error flashing non-qspi storage
Cleaning up...
Formatting APP parition done
Formatting APP partition /dev/sdf1 ...
tar --xattrs -xpf /opt/tobias/framework/assets/initrd_massflash/current_image/tools/kernel_flash/images/internal/system.img --checkpoint=10000 --warning=no-timestamp --numeric-owner -C /tmp/ci-bmRAhaKgLy
tar: Read checkpoint 10000
tar: Read checkpoint 20000
tar: Read checkpoint 30000
tar: Read checkpoint 40000
tar: ./usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc: Cannot write: Read-only file system
tar: ./usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc: Cannot utime: Read-only file system
tar: ./usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc: Cannot change ownership to uid 0, gid 0: Read-only
[a lot of errors of the same kind...]
tar: ./usr/share/perl/5.30.0/CPAN/Meta/YAML.pm: Cannot open: Input/output error
tar: ./usr/share/perl/5.30.0/CPAN/Meta/History: Cannot mkdir: Input/output error
tar: ./usr/share/perl/5.30.0/CPAN/Meta/History/Meta_1_3.pod: Cannot open: Input/output error
tar: ./usr/share/perl/5.30.0/CPAN/Meta/History/Meta_1_2.pod: Cannot open: Input/output error
tar: ./usr/share/perl/5.30.0/CPAN/Meta/History/Meta_1_0.pod: Cannot open: Input/output error
Cleaning up...
The fail rate is about 15-20% of the devices I connect. It also seems rather random on which USB port it occurs.
My actual question
- Has anyone experienced similar issues with init_rd massflash or is this even a known issue with the massflash script? If yes, how did you guys solve this?
- How many devices have been verified to work reliable (I assumed the limit was 10, because the default is limited to 10)?