TX2 NX - when PCIe devices connected over PCIe Switch, NVMe Drive sometimes fails

sevm89 · February 16, 2022, 1:41pm

Dear Nvidia

We have a Problem with the TX2 NX Module and PCIe Devices.
On the x2 Port of the TX2 NX is a NVMe SSD connected (M.2 Key M). On the other x1 PCIe Port, there are 3 PCIe Switches concatenated with also PCIe Devices additionally connected to them. We see now sometimes after reboots of the system, that the NVMe drive is not correctly initialized and the device is lost with the following message:

nvme nvme0: Minimum device page size 1324217728 too large for host (4096)
nvme nvme0: Removing after probe failure status: -19

Additionally we see an error message of the pcie-controller:

tegra-pcie 10003000.pcie-controller: PCIE: Response decoding error, signature: 11120005

When we use the exact same setting with a Xavier NX Module, we never see this error, therefore we think it is something specific with the TX2 NX Module.

Is there like a limit of devices that can be attached to the PCIe Ports? When we reduce the devices attached to the x1 Port (connect only first PCIe Switch), the NVMe gets always initialized without an error.
Let us know if you need any additional information.
Thank you.

Best regards

linuxdev · February 16, 2022, 8:27pm

I can’t answer, but will add that power delivery is often an issue. If you have a way to power any or all devices externally, then try that to see if anything changes. Another issue is often one of signal quality and/or signal timing. Take a look at any kind of power delivery changes you might be able to make which (at least for testing) might improve power isolation and regulation among the devices.

sevm89 · February 17, 2022, 6:54am

Hi linuxdev

Thank you for your answer.
We do not think power is an issue, as all connected devices are powered externally. We think it has to do with the memory allocation for the PCIe devices similar to the following topic:

Thank you.

sevm89 · February 23, 2022, 10:54am

Hi Nvidia Team

Can you help us here? Can we increase the allocated memory for PCIe devices?
Thank you.

Kind regards

sevm89 · March 2, 2022, 7:57am

Hi Nvidia Team

We have not heard anything from you yet and need a solution as soon as possible. Could you please support us?
Thank you.

Manikanta · March 9, 2022, 3:21am

Hi,

I believe PCIe link is not stable, seeing outrageously high request of page size, maybe due to reading an invalid(all 1s) value from NVMe BAR.
Please share complete uart/dmesg logs.

Thanks,
Manikanta

sevm89 · March 9, 2022, 9:53am

Hi Manikanta

Thank you for your answer.
Here are the requested log files:
dmesg_nvme_nok.txt (74.8 KB)
uart_nvme_nok.txt (21.9 KB)

Let us know if you need further information.
Thank you.

Manikanta · March 9, 2022, 11:17am

Hi,

Please provide “sudo lspci -vvv” for both working and non-working case.
In non-working case, read BAR address using busybox devmem and check if you are getting valid data.

Note: Since NVMe driver is bailing out, you might need to explicitly set bus master & memory space in EP config space using setpci before accessing BAR.

Thanks,
Manikanta

sevm89 · March 9, 2022, 12:14pm

Hi

Here the output of lspci -vvv:
nvme_not_working.txt (75.9 KB)
nvme_working.txt (75.9 KB)

Can you explain more in detail how to read BAR address and how to set bus master/memory space with setpci?
Thank you.

Manikanta · March 9, 2022, 12:41pm

Hi,

I don’t see any suspicious logs in lspci output.

To read BAR follow below steps,

Read config offset 4
setpci -s 01:00.0 0x4.l
Set LSB 3-bits
setpci -s 01:00.0 0x4.l=<val|0x7>
busybox devmem 0x40100000

Also, can you tell me if you see this issue only during warm reboot or cold boot?

Thanks,
Manikanta

sevm89 · March 9, 2022, 1:04pm

Hi

The command "busybox devmem 0x40100000 gives “0xF0013FFF”.
We see the issue in warm reboot and cold boot.

Best regards

Manikanta · March 9, 2022, 1:15pm

Hi,

BAR access looks fine, so link is stable. I don’t see any issue from Tegra side. Please check why big page size is read from NVMe BAR.

Thanks,
Manikanta

sevm89 · March 10, 2022, 10:46am

Hi

The weird thing is that when we program an EEPROM of the PCIe Switch PI7C9X2G404SV for its configuration, the NVMe is not recognized anymore with the error message in our first post. When we erase the EEPROM (PCIe Switch default configuration), the NVMe is always recognized. As the NVMe is not behind the PCIe Switch and they are connected to two different PCIe Ports of the module, we would expect that they have no influence to each other. Do you have an idea why there is a correlation?

How can we check why the big page size is read from the NVMe BAR?

Thank you.

Manikanta · March 10, 2022, 11:01am

Apply attached patch. If NVMe doesn’t come up, execute below command in root shell.

echo 1 > /sys/bus/pci/rescan

Maybe some delays are causing the issue? Try these commands in root shell and see if full hierarchy comes up or not.
cd /sys/bus/platform/drivers/tegra-pcie
ls
echo > unbind
echo > bind

How can we check why the big page size is read from the NVMe BAR?

You have to go through NVMe spec for these details.
rel-32_t210_t186.diff (762 Bytes)

system · April 6, 2022, 2:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Have ~40 NVMe drives, all same model. Half work w/ TX2, half don't. All recognized by Win10. Jetson TX2	10	695	October 18, 2021
How to enable the function of PCIe NVMe SSD on TX1? Jetson TX1	12	1889	October 18, 2021
TX2-NX module pcie 0 used as SSD, the SSD could be lost Jetson TX2 ssd	24	2378	December 1, 2021
M.2 NVMe not detected on custom Xavier NX hardware (pcie link is down) Jetson Xavier NX pcie , hw , board-design	4	1933	October 18, 2021
Boot Jetson xavier NX using M.2 Key-M SSD Jetson Xavier NX boot	21	1228	August 22, 2022
Jetson TX2 NVMe Hotplug/Hotswap Jetson Xavier NX nvme	13	2390	October 18, 2021
PCIe devices not detected on Nvidia Jetson nx device Jetson Xavier NX pcie	7	1820	March 9, 2022
Trouble getting any PCIE devices to show Jetson AGX Xavier pcie	13	1357	October 18, 2021
Devices under PCIE packet switch sometimes are not detected after system boots or reboots Jetson Xavier NX pcie , board-design	42	4104	April 2, 2022
CFExpress (NVMe) card not detected Jetson TX2	6	1013	February 20, 2020

TX2 NX - when PCIe devices connected over PCIe Switch, NVMe Drive sometimes fails

Related topics