Dear Nvidia Team
We face some problems with NVMe SSDs on our custom carrier boards. The M.2 Key M Interface is identical to the reference design. Now with some systems, we get the following two errors:
[ 1.983689] nvme nvme0: pci function 0000:01:00.0
[ 1.983890] nvme 0000:01:00.0: enabling device (0000 → 0002)
[ 62.167901] nvme nvme0: I/O 15 QID 0 timeout, disable controller
[ 62.816003] nvme nvme0: Identify Controller failed (-4)
[ 62.816037] nvme nvme0: Removing after probe failure status: -5
[ 199.557910] nvme nvme0: pci function 0000:01:00.0
[ 327.640232] nvme nvme0: Device not ready; aborting initialisation
[ 327.640272] nvme nvme0: Removing after probe failure status: -19
We tried already without success to add the kernel parameters “pcie_aspm=off” and “pci=nomsi”, still sometimes after a reboot or a poweroff, the SSD is not detected. Also changing the bootloader boot device order did not help totally. Sometimes the following workaround helps to get the NVMe back:
echo “1” > /sys/bus/pci/devices/0000:01:00.0/remove
sleep 1
echo “1” > /sys/bus/pci/rescan
Also on one system, we moved the entire rootfs to the NVMe drive, and then it was always detected fine.
On the devkit we did not see the behavior so far. Also we do not see it with every custom carrier board. The SSD is the following from Apacer:
https://industrial.apacer.com/upfiles/ADUpload/allshare/PV310-M280_EDM_20210108.pdf
Do you have any other idea what we could try? As there are similar topics in the forum, it seems a problem with certain NVMe SSDs. We would also like to mention that we see similar errors with the Xavier NX and a custom carrier.
Thank you for your help.
Best regards