TX2 NX - when PCIe devices connected over PCIe Switch, NVMe Drive sometimes fails

Dear Nvidia

We have a Problem with the TX2 NX Module and PCIe Devices.
On the x2 Port of the TX2 NX is a NVMe SSD connected (M.2 Key M). On the other x1 PCIe Port, there are 3 PCIe Switches concatenated with also PCIe Devices additionally connected to them. We see now sometimes after reboots of the system, that the NVMe drive is not correctly initialized and the device is lost with the following message:

nvme nvme0: Minimum device page size 1324217728 too large for host (4096)
nvme nvme0: Removing after probe failure status: -19

Additionally we see an error message of the pcie-controller:

tegra-pcie 10003000.pcie-controller: PCIE: Response decoding error, signature: 11120005

When we use the exact same setting with a Xavier NX Module, we never see this error, therefore we think it is something specific with the TX2 NX Module.

Is there like a limit of devices that can be attached to the PCIe Ports? When we reduce the devices attached to the x1 Port (connect only first PCIe Switch), the NVMe gets always initialized without an error.
Let us know if you need any additional information.
Thank you.

Best regards

I can’t answer, but will add that power delivery is often an issue. If you have a way to power any or all devices externally, then try that to see if anything changes. Another issue is often one of signal quality and/or signal timing. Take a look at any kind of power delivery changes you might be able to make which (at least for testing) might improve power isolation and regulation among the devices.

Hi linuxdev

Thank you for your answer.
We do not think power is an issue, as all connected devices are powered externally. We think it has to do with the memory allocation for the PCIe devices similar to the following topic:

Thank you.

Hi Nvidia Team

Can you help us here? Can we increase the allocated memory for PCIe devices?
Thank you.

Kind regards

Hi Nvidia Team

We have not heard anything from you yet and need a solution as soon as possible. Could you please support us?
Thank you.

Hi,

I believe PCIe link is not stable, seeing outrageously high request of page size, maybe due to reading an invalid(all 1s) value from NVMe BAR.
Please share complete uart/dmesg logs.

Thanks,
Manikanta

Hi Manikanta

Thank you for your answer.
Here are the requested log files:
dmesg_nvme_nok.txt (74.8 KB)
uart_nvme_nok.txt (21.9 KB)

Let us know if you need further information.
Thank you.

Hi,

Please provide “sudo lspci -vvv” for both working and non-working case.
In non-working case, read BAR address using busybox devmem and check if you are getting valid data.

Note: Since NVMe driver is bailing out, you might need to explicitly set bus master & memory space in EP config space using setpci before accessing BAR.

Thanks,
Manikanta

Hi

Here the output of lspci -vvv:
nvme_not_working.txt (75.9 KB)
nvme_working.txt (75.9 KB)

Can you explain more in detail how to read BAR address and how to set bus master/memory space with setpci?
Thank you.

Hi,

I don’t see any suspicious logs in lspci output.

To read BAR follow below steps,

  1. Read config offset 4
    setpci -s 01:00.0 0x4.l
  2. Set LSB 3-bits
    setpci -s 01:00.0 0x4.l=<val|0x7>
  3. busybox devmem 0x40100000

Also, can you tell me if you see this issue only during warm reboot or cold boot?

Thanks,
Manikanta

Hi

The command "busybox devmem 0x40100000 gives “0xF0013FFF”.
We see the issue in warm reboot and cold boot.

Best regards

Hi,

BAR access looks fine, so link is stable. I don’t see any issue from Tegra side. Please check why big page size is read from NVMe BAR.

Thanks,
Manikanta

Hi

The weird thing is that when we program an EEPROM of the PCIe Switch PI7C9X2G404SV for its configuration, the NVMe is not recognized anymore with the error message in our first post. When we erase the EEPROM (PCIe Switch default configuration), the NVMe is always recognized. As the NVMe is not behind the PCIe Switch and they are connected to two different PCIe Ports of the module, we would expect that they have no influence to each other. Do you have an idea why there is a correlation?

How can we check why the big page size is read from the NVMe BAR?

Thank you.

  1. Apply attached patch. If NVMe doesn’t come up, execute below command in root shell.

echo 1 > /sys/bus/pci/rescan

  1. Maybe some delays are causing the issue? Try these commands in root shell and see if full hierarchy comes up or not.
    cd /sys/bus/platform/drivers/tegra-pcie
    ls
    echo > unbind
    echo > bind

How can we check why the big page size is read from the NVMe BAR?

You have to go through NVMe spec for these details.
rel-32_t210_t186.diff (762 Bytes)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.