We have a Problem with the TX2 NX Module and PCIe Devices.
On the x2 Port of the TX2 NX is a NVMe SSD connected (M.2 Key M). On the other x1 PCIe Port, there are 3 PCIe Switches concatenated with also PCIe Devices additionally connected to them. We see now sometimes after reboots of the system, that the NVMe drive is not correctly initialized and the device is lost with the following message:
nvme nvme0: Minimum device page size 1324217728 too large for host (4096)
nvme nvme0: Removing after probe failure status: -19
Additionally we see an error message of the pcie-controller:
When we use the exact same setting with a Xavier NX Module, we never see this error, therefore we think it is something specific with the TX2 NX Module.
Is there like a limit of devices that can be attached to the PCIe Ports? When we reduce the devices attached to the x1 Port (connect only first PCIe Switch), the NVMe gets always initialized without an error.
Let us know if you need any additional information.
Thank you.
I can’t answer, but will add that power delivery is often an issue. If you have a way to power any or all devices externally, then try that to see if anything changes. Another issue is often one of signal quality and/or signal timing. Take a look at any kind of power delivery changes you might be able to make which (at least for testing) might improve power isolation and regulation among the devices.
Thank you for your answer.
We do not think power is an issue, as all connected devices are powered externally. We think it has to do with the memory allocation for the PCIe devices similar to the following topic:
I believe PCIe link is not stable, seeing outrageously high request of page size, maybe due to reading an invalid(all 1s) value from NVMe BAR.
Please share complete uart/dmesg logs.
Please provide “sudo lspci -vvv” for both working and non-working case.
In non-working case, read BAR address using busybox devmem and check if you are getting valid data.
Note: Since NVMe driver is bailing out, you might need to explicitly set bus master & memory space in EP config space using setpci before accessing BAR.
The weird thing is that when we program an EEPROM of the PCIe Switch PI7C9X2G404SV for its configuration, the NVMe is not recognized anymore with the error message in our first post. When we erase the EEPROM (PCIe Switch default configuration), the NVMe is always recognized. As the NVMe is not behind the PCIe Switch and they are connected to two different PCIe Ports of the module, we would expect that they have no influence to each other. Do you have an idea why there is a correlation?
How can we check why the big page size is read from the NVMe BAR?
Apply attached patch. If NVMe doesn’t come up, execute below command in root shell.
echo 1 > /sys/bus/pci/rescan
Maybe some delays are causing the issue? Try these commands in root shell and see if full hierarchy comes up or not.
cd /sys/bus/platform/drivers/tegra-pcie
ls
echo > unbind
echo > bind
How can we check why the big page size is read from the NVMe BAR?
You have to go through NVMe spec for these details. rel-32_t210_t186.diff (762 Bytes)