Jetson TX2 NVMe Hotplug/Hotswap PCIe Switch

@vidyas

Hello,
I got another question expanding on the problem concerning NVMe hotplugging / hotswapping you helped me with back in February.

We now would need to hotplug / hotswap a NVMe ( CFX Card to be precise ) that is connected to the Xavier NX via a PCIe switch ( Pericom PI7C9X2G ).

If the CFX card is pluggin in before the system is booted it enumerates as 0004:05:00.0 Non-Volatile memory controller and works as expected.

factory@localhost:~$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:03:00.0 RAM memory: Xilinx Corporation Device d021
0004:05:00.0 Non-Volatile memory controller: Device 1987:5013 (rev 01)
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

$ ls /sys/class/nvme/
nvme0 nvme1

If the CFX card is plugged in after the system booted and I execute the hotplug sysfs command for the switch nothing happens.

$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:03:00.0 RAM memory: Xilinx Corporation Device d021
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

** plug in the CFX card

$ sudo cat /sys/kernel/debug/pcie-4/hot_plug

$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:03:00.0 RAM memory: Xilinx Corporation Device d021
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

With a rescan the CFX is enumerated but no nvme device is created.

$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:03:00.0 RAM memory: Xilinx Corporation Device d021
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

** plug in the CFX card

sudo sh -c "echo 1 > /sys/bus/pci/rescan"

factory@localhost:~$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:03:00.0 RAM memory: Xilinx Corporation Device d021
0004:05:00.0 Non-Volatile memory controller: Device 1987:5013 (rev 01)
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

$ ls /sys/class/nvme/
nvme0

Is there any way to hotplug an nvme if it is connected to the root port via a pcie switch?

Thanks

Your observations are in line with the expectation. hot_plug is meant for devices that are connected directly behind Tegra’s root port. So, it doesn’t work for the devices connected behind a PCIe switch.
BTW, I do expect ‘rescan’ to work in your case and I see that it is enumerating the NVMe device also. What is the issue in this case? you don’t see the nvme driver getting bind’ed with the device? What does dmesg tell us? are there any issues being reported?
Have you tried force binding the device<->driver using the following command?
echo "0004:05:00.0" > /sys/bus/pci/drivers/nvme/bind

Thanks very much,

it works like you suggested by removing the root port and rescaning. After this the CFX card is mountable. Hotswapping works as well by removing the root port befor detaching the CFX card and rescaning.

*** boot system without CFX card attached 

$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:03:00.0 RAM memory: Xilinx Corporation Device d021
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
f

*** attach CFX card

$ sudo sh -c "echo 1 > /sys/bus/pci/devices/0004\:00\:00.0/remove"

$ sudo lspci
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

$ sudo sh -c "echo 1 > /sys/bus/pci/rescan"

$ sudo lspci
0004:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad1 (rev a1)
0004:01:00.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:01.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:02.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:02:03.0 PCI bridge: Pericom Semiconductor Device 2404 (rev 05)
0004:05:00.0 Non-Volatile memory controller: Device 1987:5013 (rev 01)
0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0005:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

$ ls /sys/class/nvme/
nvme0  nvme1

@vidyas

Sry for bothering you again.

We would need to be able to hotplug / hotswap the CFX card connected to the pcie switch without removing the whole switch and interrupting the connection to the other devices on the switch though.

If I only do a rescan without first executing remove the following dmesgs are thrown. Same thing happens if I first remove only the switch port the CFX card is connected to.

echo 1 >  /sys/bus/pci/devices/0004:02:03.0/remove"
[ 1391.242126] pci 0004:02:03.0: [12d8:2404] type 01 class 0x060400
[ 1391.243070] pci 0004:02:03.0: supports D1 D2
[ 1391.243077] pci 0004:02:03.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 1391.243562] iommu: Adding device 0004:02:03.0 to group 63
[ 1391.244463] pci 0004:05:00.0: [1987:5013] type 00 class 0x010802
[ 1391.244667] pci 0004:05:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[ 1391.245862] iommu: Adding device 0004:05:00.0 to group 67
[ 1391.246407] pci 0004:02:03.0: BAR 14: no space for [mem size 0x00100000]
[ 1391.246413] pci 0004:02:03.0: BAR 14: failed to assign [mem size 0x00100000]
[ 1391.246420] pci 0004:05:00.0: BAR 0: no space for [mem size 0x00004000 64bit]
[ 1391.246424] pci 0004:05:00.0: BAR 0: failed to assign [mem size 0x00004000 64bit]
[ 1391.246431] pci 0004:02:03.0: PCI bridge to [bus 05]
[ 1391.248518] nvme nvme1: pci function 0004:05:00.0
[ 1391.249567] nvme nvme1: Minimum device page size 134217728 too large for host (4096)
[ 1391.251670] nvme nvme1: Removing after probe failure status: -19

Since Tegra doesn’t natively support hot-plugging devices behind a PCIe switch, I’m afraid the above may not be possible.

@vidyas
Would it be possible on a Xavier NX device?