Does PCIe in the M.2 need some kind of activation?

I want to try a dual gigabit mPCIe card with Nano devkit to add some more Gigabit Ethernet connectivity.

I use mPCIe->M.2 key A+E converter: https://www.delock.de/produkte/S_62848/merkmale.html?setLanguage=en .

The Ethernet NIC is this mPCIe card: https://www.digitus.info/de/produkte/aktive-netzwerkkomponenten/netzwerkkarten-und-adapter/dn-10134/?PL=en .

This setup has been successfully tested on Xavier NX devkit, so the components work as required.

However, I can’t get it to work on Nano devkit (the earlier version, I think A02). The mPCIe->M2 converter has a LED that indicates it is powered. On Xavier NX, the LED is on since I plug the power. On Nano, it just blinks for a tenth of a second and is off all the rest of time. And the Ethernet card doesn’t show up in lspci. dmesg doesn’t show any tries on bringing up the card.

Because of the power LED on the converter, it seems to me this could be a power-related problem. I found in Xavier NX docs that the M2 A+E key slot should provide 0.8 A @ 3V3. Nano doesn’t state that per connector, but whole 3V3 line should have 1.5 A, and I don’t have any other devices connected, so there should be enough juice for the card (if it works with the 0.8 A on Xavier NX).

Is there something I have to set in software to get it working?

I tried editing extlinux.conf, appending pci=noaer pcie_aspm=off, but it didn’t help. Anyways, I think putting it into kernel parameters is late, because the converter board’s LED blinks right after I plug the jack in, and there’s definitely not enough time to reach kernel. And I’m not sure whether the PCIe lane can be restarted with a different config later.

I use the system flashed on SD card. I’ve stumbled upon Where can I change default cmdline/cbootargs and other questions regarding jetson nano boot process. where the last comment says every developer should use the flash.sh script, because it flashes not only the sd card, but also QSPI. Now that’s interesting. Should I consider that? What’s QSPI? I thought it is a nickname for the eMMC on the module-only version of Nano.

One last thought - can’t this have some conection to rfkill? Isn’t the system disabling power to the M2 slot because it has never been configured with a wifi card?

Thank you for helping me solve this issue.

Can you please give info about the SW you are using? (i.e. 32.4?? and Jetspack version)

I installed the OS on Nano using this tutorial https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit#write, flashing the OS image to a microSD card. I’m using the latest available version (as of last week when I did the experiments).

I did a few more tests with different kinds of hardware.

I tried USB->mPCIe->M.2 A+E on Nano, and it did not work (https://www.delock.de/produkte/G_95234/merkmale.html + https://www.delock.de/produkte/G_62848/merkmale.html). The best I got was that I was able to connect a flash drive to the usb (the card showed in lspci and the flash drive showed in lsusb) and start a copy test, and I got like 15% of the transfer speed it can do, and the whole system froze after copying about a gigabyte (consistently). When I connected this exact set of peripherals to Xavier NX, it worked flawlessly. This USB card has separate 5V pins for power, so the problem should not be that Nano can’t give enough power - in this case, it should only power the USB-PCIe chip, but not the USB devices. I took the 5V power from Nano’s pins, and the Nano itself is powered via a barrel jack with 4A power. I also tried powering the USBs with a completely separate power source, but that somehow did not work. I’d probably have to connect the grounds of Nano and the power source, which I did not do during the test.

Throughout all these tests, I saw lots of dmesg errors about pcie Advanced Error Reporting (AER) fixing some issues. These do not show on Xavier NX, or just one or two. On nano, it’s a constant stream of them. I tried booting the kernel with pci=noaer, but it didn’t help (functionally; the errors disappeared, of course). I also tried pcie_aspm=off, but again no help.

What is the substantial difference between the Nano’s M.2 A+E port and the Xavier NX’s one? Looking into the datasheet, they should be pretty much the same, just Xavier has Gen 3 and Nano Gen 2 PCIe (but the hardware I tried shouldn’t require Gen 3).

Any news or thoughts about this?

Nano and NX have different Tegra chips but as far as the PCIe interface is concerned, it should work fine in both cases. I think the reason why the LED glows for a very short time and then goes off is that the PCIe link didn’t come up.
Could you please try the below patch?

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 63c0e343d388…52c46b68981e 100644
— a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -2194,7 +2194,7 @@ static bool tegra_pcie_port_check_link(struct tegra_pcie_port *port)
tegra_pcie_port_reset(port);
} while (–retries);

  •   return false;
    
  •   return true;
    

}

static void tegra_pcie_apply_sw_war(struct tegra_pcie_port *port,

This would stop the controller from getting powered down. Once the system boots to console, can you please check ‘lspci -vv’ output and see if the corresponding root port’s DLActive has “+” instead of “-”?
I’m just wondering if the endpoint, in this case, needs more like to establish the link with the root port. If so, the above patch should solve that issue.

Hmm, I can’t replicate the not-lighting-LED problem anymore, the M.2->mPCIe card seems to work correctly now regardless of using a patched kernel or not. I’m not sure what changed, but now it starts up every time.

However, the mPCIe cards connected to it still behave wrong. I tried both with patched and unpatched kernel, and it seems to me the patch did not help (and usually made things even worse, i.e. none of the devices worked after reboot). The Ethernet card more or less worked with the unpatched kernel, but there were spurious errors like CPU freeze and watchdog-induced reboot. After such reboot, the ethernet controller endpoints would not be found and appeared only after power off/on (however, the PCIe switch on the mPCIe card appeared - all three devices; they were just missing the ethernet endpoints).

The USB controller card was even worse now - it started correctly on coldboot, but warmboots almost always ended up in stuck CPUs. Even after the coldboot start, as soon as I instered a flash drive, something happened on the PCI bus and the controller disappeared (I tested this both with the patch and without it). This is what appeared on UART console:

[   35.915792] usb 2-2: new SuperSpeed USB device number 2 using xhci_hcd
[   35.941499] usb 2-2: New USB device found, idVendor=125f, idProduct=de7a
[   35.941565] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   35.941610] usb 2-2: Product: ADATA USB Flash Drive
[   35.941651] usb 2-2: Manufacturer: ADATA
[   35.941693] usb 2-2: SerialNumber: *****
[   35.948270] usb-storage 2-2:1.0: USB Mass Storage device detected
[   35.952584] scsi host0: usb-storage 2-2:1.0
[   36.982727] scsi 0:0:0:0: Direct-Access     ADATA    USB Flash Drive  1.00 PQ: 0 ANSI: 6
[   36.991944] sd 0:0:0:0: [sda] 30720000 512-byte logical blocks: (15.7 GB/14.6 GiB)
[   36.982727] scsi 0:0:0:0: Direct-Acce[   37.000049] sd 0:0:0:0: [sda] Write Protect is off
ss     ADATA    USB Flash Drive  1.00 PQ[   37.008579] sd 0:0:0:0: [sda] Write cache: disabled, read cache: disabled, doesn't support DPO or FUA
: 0 ANSI: 6
[   36.991944] sd 0:0:0:0: [sda] 30720000 512-byte logical blocks: (15.7 GB/14.6 GiB)
[   37.000049] sd 0:0:0:0: [sda] Write Protect is off
[   37.007972] sd 0:0:0:0: [sda] Mode Sense: 23 00 00 00
[   37.008579] sd 0:0:0:0: [sda] Wr[   37.038818] sd 0:0:0:0: [sda] Attached SCSI removable disk
ite cache: disabled, read cache: disabled, doesn't support DPO or FUA
[   37.034516]  sda: sda1 sda2 sda4
[   37.038818] sd 0:0:0:0: [sda] Attached SCSI removable disk
[   38.408487] tegra-pcie 1003000.pcie: unexpected MSI
[   48.498396] xhci_hcd 0000:01:00.0: xHCI host not responding to stop endpoint command.
[   48.506285] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
[   48.498396] xhci_hcd 0000:01:00.0: xHCI host not resp[   48.517527] xhci_hcd 0000:01:00.0: HC died; cleaning up
onding to stop endpoint command.
[   48.506285] xhci_hcd 0000:01:00.0: Assuming host is dying, halting host.
[   48.517527] xhci_hcd 0000:01:00.0: HC died; cleaning up
[   48.527338] usb 2-2: USB disconnect, device number 2
[   48.594234] blk_update_request: I/O error, dev sda, sector 768
[   48.594215] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[   48.594227] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 00 03 00 00 01 00 00
[   48.594234] blk_update_request: I/O error, dev sda, sector 768

and another try

[   17.604597] CPU0: SError detected, daif=1c0, spsr=0x200000c5, mpidr=80000000, esr=bf000002
[   28.139736] CPU2: SError detected, daif=1c0, spsr=0x200000c5, mpidr=80000002, esr=bf000002
[   28.139739] CPU1: SError detected, daif=1c0, spsr=0x600000c5, mpidr=80000001, esr=bf000002
[   28.139758] tegra-xusb 70090000.xusb: controller firmware hang
[   28.166652] tegra-xusb 70090000.xusb: WARN: xHC CMD_RUN timeout
[   38.688936] CPU1: SError detected, daif=1c0, spsr=0x600000c5, mpidr=80000001, esr=bf000002
[   38.688961] tegra-xusb 70090000.xusb: xhci_suspend() failed -110
[   49.215816] CPU2: SError detected, daif=1c0, spsr=0x600000c5, mpidr=80000002, esr=bf000002
[   49.215826] INFO: rcu_preempt detected stalls on CPUs/tasks:
[   49.215827] INFO: rcu_preempt self-detected stall on CPU
[   49.215835]  0-...: (1 GPs behind) idle=a47/140000000000001/0 softirq=7697/7702 fqs=0
[   49.215840]  1-...: (1 GPs behind) idle=723/140000000000001/0 softirq=7437/7438 fqs=0
[   49.215843]  1-...: (1 GPs behind) idle=723/140000000000001/0 softirq=7437/7438 fqs=0
[   49.215846]
[   49.215847]  2-...: (2 GPs behind) idle=073/1/0 softirq=8665/8666 fqs=0
[   49.215852]
[   49.215853] rcu_preempt kthread starved for 5269 jiffies! g342 c341 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
[   49.216082] rcu_preempt kthread starved for 5269 jiffies! g342 c341 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0

The tegra-pcie 1003000.pcie: unexpected MSI error appeared both with the ethernet controller and the usb, and once it appeared, the card became basically unusable.

Now, this is weird behavior. Do you start observing this only after you start exercising the Ethernet functionality? To me, this looks more like an issue with the power source now.

It does’t happen when I do an iperf test of the on-board Ethernet controller. I power the Nano with a 5V/4A barrel jack power source.

It’s definitely not the PSU. I did a stress test - nbody CUDA sample, stress program, iperf test and copying from a USB-SATA magnetic drive (powered from the USB). It’s been running fine for more than half an hour with the PSU being just mildly warm.

powertop:

          Usage     Device name
         90,0%        CPU use
        24795 pkts/s  Network interface: eth0 (r8168)
        100,0%        runtime-70030000.hda
        100,0%        USB device: 4-Port USB 3.1 Hub (Generic)
        100,0%        runtime-1003000.pcie
        100,0%        runtime-60020000.dma
        100,0%        runtime-70110000.clock
        100,0%        runtime-50000000.host1x
        100,0%        runtime-70090000.xusb
        100,0%        runtime-57000000.gpu
        100,0%        PCI Device: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
        100,0%        PCI Device: NVIDIA Corporation Device 0faf
        100,0%        USB device: USB 3.0 Device (USB 3.0 Device)
        100,0%        USB device: xHCI Host Controller

tegrastats:

RAM 1189/3964MB (lfb 1x2MB) SWAP 32/1982MB (cached 0MB) CPU [100%@1479,100%@1479,100%@1479,100%@1479] EMC_FREQ 0% GR3D_FREQ 42% PLL@55C CPU@59.5C PMIC@100C GPU@55C AO@67.5C thermal@57C POM_5V_IN 5209/5209 POM_5V_GPU 235/235 POM_5V_CPU 3015/3015

I also did a test taking out the wifi card from Xavier NX and put it in my Nano. Even this card doesn’t fully work. It appears in the system and responds to most control commands, but it won’t start scanning networks - failing with “operation not permitted” and such.

This appears in dmesg as soon as kernel loads the driver, and then from time to time too:

rtl88x2ce enabling device (0000 -> 0003) mc-err: (0)
csr_afir: EMEM address decode error

Okay, the Azurewawe wifi card was another issue, it required a DTB patch: Using Xavier NX Dev Kit's Azurewave AW-CB375NF M.2 Module on Jetson Nano . After this patch, the wifi card works correctly. That is the first device that actually works correctly in my Nano’s M.2 slot. But it’s great at least one does, because this tells that the slot isn’t damaged.

The DTB patch did not help with the other two devices I’m testing, though. I’ve rebuilt kernel with PCIE_DEBUG=y to get some more information. Here are some dmesg logs:

dmesg.rj45.cable_plugging (124.9 KB)
dmesg.rj45.no_cable (80.1 KB)
dmesg.usb.1 (70.4 KB)
dmesg.usb.2 (102.0 KB)

Basically, as soon as I connect one of these devices, the Nano doesn’t even finish booting without pci=noaer kernel parameter. With it, it boots and until I start using the cards, everything looks well, but as soon as a I plug a cable in either of them, bad things start happening and it usually ends up with the Nano rebooting itself.