Well after some 12 hours of troubleshooting, experiments, and tests, I have a functioning workaround - but not quite a solution to this issue.
For others who might encounter the same the SSD in question is a 512 GB Acer FA100, PCIe Gen3 x4, NVMe1.4 SSD. In this case, the Orin Nano in question is a -003 variant (no SD drive) mounted to a Waveshare carrier board, but as I mentioned, this issue was also seen on an “official” Orin Nano Super Dev Kit a few days ago as well. I believe the root problem is the SSD itself not playing nice.
In short, some kind of NVMe checks and tests fail during startup, and by default each of these checks has a timeout of 20 seconds. This can lead to extremely long time before the system will successfully load /dev/nvme0n1p1 up to 300 seconds (5 mins) after power on.
The problem is that the built-in initrd → init startup script has a hard coded retry interval of 200 ms up to 50 retries = 10 seconds. After which, the system will automatically bail out and reboot.
These numbers are not configurable via extlinux.conf and from testing I observed that the init script on the Orin also ignores parameters such as rootwait, rootdelay, and panic=0.
That being said, the complex workaround requires two steps:
- Add the following
APPEND argumens to /boot/extlinux/extlinux.conf:
nvme_core.io_timeout=2 nvme_core.admin_timeout=2 nvme_core.default_ps_max_latency_us=0
- Modify the hard-coded values in
/boot/initrd → init (after extraction and external flashing):
_mount_root () {
local dev="${1}"
local mnt="${2}"
local retry="${3}"
local readonly="${4}"
local mounted=0
local count=0
while [ ${count} -lt 60 ]; do
sleep 0.5;
count="$(expr ${count} + 1)"
if [ "${readonly}" -eq 1 ]; then
mount -r "${dev}" "${mnt}"
else
mount "${dev}" "${mnt}"
fi
Note: for some reason you cannot modify initrd → init script on-box, it must be done externally and reflashed based on my experiences
The first change speeds up the failing startup tests from 20 seconds down to 2 such that the nvme disk is mounted within ~15 seconds, and the second changes the boot retry to 60 times every 500ms = 30 seconds.
In this way, the boot finally succeeds. It is definitely a hack, and I know there may be unintended side-effects of changing the NVMe timeout settings.
Outstanding questions:
What is the official stance here? Are there certain brands of SSDs which are “supported” and known to work, and is Acer not on that list? What is the root cause of these timeout errors? How should users be picking their peripherals, are there certain specs we should be looking for?
Why does the Orin have hard-coded, non-configurable startup timeouts? After three failures, what is supposed to happen? From observation the devices brick themselves in this case (without a keyboard and monitor to reset their BIOS).