How to enable the function of PCIe NVMe SSD on TX1?

Dear Sir,

Our customer wants to use the PCIe NVMe SSD on TX1 to develop their software.

After checking the related topics, I still can not get the correct function for it.
(https://devtalk.nvidia.com/search/more/sitecommentsearch/NVMe%20SSD/?boards=164&order=date-desc)

Can you give me any advice?

  1. My Environment
    Board: Jetson TX1
    SSD: ASUS HYPER M.2X4 MINI
    Intel SSD 760p 128GB
    L4T: 24.2.1

  2. I know I need to enable the kernel config “CONFIG_BLK_DEV_NVME=y”.
    After I recompile the kernel, It’s ready.

zcat /proc/config.gz | grep -i nvme

CONFIG_BLK_DEV_NVME=y

  1. When it uses “CONFIG_BLK_DEV_NVME=y” to boot up, it will delay 3X seconds to show these messages and then continue to see the buntu desktop. I don’t get any ssd interface.

    [ 2.854423]PCI: enabling device 0000:01:00.0 (0140 -> 0142)
    [ 3.058024] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010
    [ 3.066142] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0008(Requester ID)
    [ 3.066142] pcieport 0000:00:01.0: device [10de:0fae] error status/mask=00004000/00000000
    [ 3.078021] pcieport 0000:00:01.0: [14] Completion Timeout (First)

  2. If I follow up this topic to disable “CONFIG_PCIEASPM=n”, I will get a kernel crash(Watchdog detected hard LOCKUP on cpu 0).
    (https://devtalk.nvidia.com/default/topic/973034/jetson-tx1/-solved-problem-with-intel-600p-nvme-ssd/1)

  3. lspci command
    00:01.0 PCI bridge: NVIDIA Corporation Device 0fae (rev a1)
    01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03)

Thanks.
Yours Sincerely,
S.K.

AER is a PCIe bus issue, and not actually an NVMe issue (though I suppose it could end up causing NVMe errors). Since the kernel was changed, there are a couple of steps to verify prior actually looking at PCI.

The “CONFIG_LOCALVERSION” in a kernel config changes the suffix from the command “uname -r”. The actual module search location is “/lib/modules/(uname -r)/". What is your new "uname -r", and are all of your kernel modules somewhere in "/lib/modules/(uname -r)/”?

What was your kernel’s starting configuration? For example, did you use the “/proc/config.gz”, or “make tegra_defconfig”?

If it turns out the kernel setup is valid, then you’ll want to post a verbose lspci. Since your device shows as PCI slot “01:00.0”, the command for this, and to log the result, is:

sudo lspci -s 01:00.0 | tee log.txt

If you hover your mouse over the quote icon in the upper right corner of one of your existing posts, then other icons will show up. The paper clip icon is for attaching files. Or you can use the “code” icon (looks like “</>” while editing a post) and paste into that (code icons preserve white space, add scrollbars, and makes it easier to read).

Hi linuxdev,

  1. What is your new “uname -r”, and are all of your kernel modules somewhere in “/lib/modules/$(uname -r)/”?
ubuntu@tegra-ubuntu:~$ uname -r
3.10.96-tegra+
ubuntu@tegra-ubuntu:~$

Please see the attaching file “3.10.96-tegra+.tar.gz”

2.What was your kernel’s starting configuration? For example, did you use the “/proc/config.gz”, or “make tegra_defconfig”?

make tegra21_defconfig

Please see the attaching file “config.gz”.

3.sudo lspci -s 01:00.0 | tee log.txt
Please see the attaching file “log.txt”.

Thanks.

Yours Sincerely,
S.K.
config.gz (26.3 KB)
log.txt (79 Bytes)

Do you have all modules somewhere in “/lib/modules/3.10.96-tegra+/”? That trailing “+” is typically something people don’t realize is being added in (there is a config script which sometimes adds this, and I’ll recommend removing it). Although your “CONFIG_LOCALVERSION” is just “-tegra” (as recommended), this extra script with the “+” being added might mean some of your modules are no longer in the correct directory.

There are external sources sometimes used for the Linux kernel, and for whatever reason these will assume you want to alter the “uname -r” to be unique. If this is causing modules to be placed in the wrong location, then you will need to rebuild the Image after doing this in the source code of the kernel (there may be other ways of doing this, but I edit like this):

  1. Find "scripts/localversion" in the kernel source.
  2. Add a return statement so "scm_version()" does not alter with the "+".
  3. Right after the declaration of variables in that file add "**return**". Will look like this:
    scm_version()
    {
            local short
            short=false
            <i><b>**return**</b></i>
    

If all of your modules were already in the correct place, then no need to worry about the above.

Note that even if modules were not found that the non-module features would still be working. I mention modules because although your tar file showed the metadata I don’t know if this file was set up to recursively include subdirectories. The actual modules are in the subdirectories, and I saw none, e.g., in “/lib/modules/3.10.96-tegra+/kernel/…”. Good debugging isn’t really possible until you know the modules are all present.

As a side comment keep in mind that this kernel is rather old. For quite some time the TX1 did not have newer releases, but due to the Nano release you can now get some newer R32.x releases for use with the TX1. It is possible that you might benefit from a newer release (this is definitely not a guarantee, but it is probably worth your time to investigate a newer release).

Hi Linuxdev,

After I tried your flow, it’s the same situation.

I config PCIe Subsystem and NVMe SSD(CONFIG_BLK_DEV_NVME=y) are built-in kernel drivers to avoid the module issue.
If I lost any module driver, The kernel will show the error messages in my mind.

Next Action: I have a chance for R28.2 environment.

Thanks!

Yours Sincerely,
S.K.

At which point during boot are you actually requiring the NVMe drive?

If you are mounting this after the kernel and modules load, then the requirements differ from using the drive during boot. For example, it should be simple to mount an NVMe partition on “/usr/local/”, but mounting the NVMe partition on “/” would be much more complicated.

Prior to the kernel loading the boot stages would require both PCIe and NVMe drivers. If loading after boot stages, then only the Linux kernel would be involved.

NOTE: PCIe bus signal quality issues would always matter.

Hi Linuxdev,

  1. It’s not for boot. (just storage)

  2. When I ues the R28.2 environment, TX1 does not recognize the NVMe SSD.
    (Module(default setting) and built-in methods are identical.)

    If I use the cmd “lspci”, there is not any interface.

    [my expectation]
    NVMe devices should show up under /dev/nvme*.

Thanks!

Yours Sincerely,
S.K.

What you described is not actually an NVMe issue. The NVMe is never getting a chance to run, and I believe the options you have chosen in kernel config for NVMe are valid for mounting such a drive in Linux (after boot).

This log line is more or less a “smoking gun” evidence:

[ 3.058024] pcieport 0000:00:01.0: AER: Uncorrected (Non-Fatal) error received: id=0010

I don’t know about the “Non-Fatal”, but it is “Uncorrected”. Before saying more on this, you should know some details about PCIe on the Jetsons.

During boot PCIe will enumerate devices. The devices won’t necessarily have a driver, but the bus is powered up and devices are identified. After boot completes, then a regular desktop PC would keep power to the bus regardless of any devices…“lspci” would for example show the PCIe bridges. On a Jetson, after boot completes, if there are no devices on the PCIe bus (other than the bridge), then the bus is powered down to reduce power consumption. This can be a problem for people with PCIe devices which need to boot prior to producing a valid PCIe enumeration response, e.g., someone with an FPGA may need to patch to do late enumeration or to keep bus power up even if no device is initially detected.

In your case there appears to be some sort of bus error, and the actual card connected in the slot fails to enumerate, and power goes off. Is this an NVMe which is actually on a PCIe card? Or is it some sort of adapter which the NVMe then plugs into?

If this is an adapter, e.g., an SSD might connect to a SATA card on PCIe, or an NVMe might be on a PCIe carrier which adapts an NVMe which wouldn’t normally go directly to the PCIe. If there is any kind of adapter between the NVMe and the PCIe slot, then the adapter is suspect. If the NVMe natively uses PCIe (if the NVMe is itself a PCIe slot card), then we know the NVMe itself has a PCIe error.

Because the “/dev/nvme*” only exists after the NVMe driver loads it is expected that there will be no such device if the PCIe is failing…the driver would never get a chance to load. Regardless of how many NVMe kernel configuration changes you make the device will never work if the PCIe issues are not resolved.

Can you provide details about the specific device, and any kind of adapter board which might be used on this device?

Hi Linuxdev,

Board: Jetson TX1
SSD: ASUS HYPER M.2X4 MINI + Intel SSD 760p 128GB
https://www.asus.com/Motherboard-Accessories/HYPER_M2_X4_MINI_CARD/
https://ark.intel.com/content/www/us/en/ark/products/134577/intel-ssd-760p-series-128gb-m-2-80mm-pcie-3-0-x4-3d2-tlc.html
That’s all.

Thanks!
Yours Sincerely,
S.K.

Talking about signal quality is difficult since the PCIe signal depends drastically upon the shape and length of wiring. This includes every circuit all the way up until the end point is reached. Typically if signal is “partially” degraded a revision 3 speed will revert back to revision 2, and a revision 2 can revert back to revision 1 speeds. The TX2 carrier only supports up to revision 2, and so if revision 2 fails it will drop the signal back to revision 1 speeds. If revision 1 cannot succeed, then no device is considered to be attached.

In your case we are unable to get the verbose version of lspci since you get no lspci at all. Something is noted in the logs, so we know the device was seen, but it never tells us what the “uncorrected error” is (whether or not this is considered fatal could depend on other software). In some way the signal quality seems to have forced degrading below the rev. 1 speed. Often each connector is a weak point, and more connectors (or just longer traces) will make things worse. I suspect the hardware is perfectly good, but signal quality just won’t allow this combination. Someone else may have a way to dig deeper into this by forcing the power rails to stay up after the device is marked as failed, but I doubt it will make it work even if we could prove it is a signal issue.

You should at minimum use this device in another computer’s PCIe and see if the disk can be partitioned and formatted, which would in turn probably make signal quality “almost” a guaranteed cause of the issue (other software could still be a cause, but I doubt it is a software issue unless there is some sort of firmware update for the NVMe).

Another consideration might be as simple as making sure the drive is correctly inserted in the slot. Without a proper mechanical chassis to hold the card in place this may not be as simple as it sounds since very minor changes in the connector could change signal quality (e.g., the grounding wire shapes).

Another test might be a different brand of carrier (the NVMe to PCIe adapter).

If someone else has an idea of how to force the PCIe to remain on and provide a verbose lspci to debug with it might help.

Can you please use the latest release and try shoring CLKREQ signal of the slot with the ground i.e. B-12 (CLKREQ) with any of A-12/A-4/B-4/B-7 (all are Grounds)

Hi sk1977.huang,

We haven’t heard back from you in a couple weeks, so marking this issue resolved by “accepting” this comment as answer.
Please open a new forum issue when you are ready and we’ll pick it up there.