PCIe Hot-Plug not working

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.8.2.10409
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

I want to use a miniSAS cable to communicate between two NVIDIA DRIVE AGX Orin Devkits(Hereafter, it is abbreviated as DDPO).
So I connected two DDPO with miniSAS cables and operated them according to the instructions in the URL below.

Unfortunately, DDPO configured as PCIe Root Port did not recognize the other DDPO configured as PCIe Endpoint as a PCIe device (DDPO configured as PCIe Root Port does not show Device ID for DDPO configured as PCIe Endpoint after entering lspci command (Linux command)).

Please answer the following questions.

Question1
Has there been a track record of communication between two DDPO using miniSAS cables?

Question2
Do I need to update PCIe Retimer FW?

Question3
Do I need to change the Device Tree?
Also, do I need to update AURIX FW?

Question4
If there is any additional work that needs to be done, could you tell me specifically?

Dear @shibata-a,
We will check internally and get back to you on this.

Question1
Has there been a track record of communication between two DDPO using miniSAS cables?
[NV] You should be able to see the related messages in kernel log while you execute the following hotplug commands:
echo 0 > controllers/141c0000.pcie_ep/start
echo 1 > controllers/141c0000.pcie_ep/start

**Question2**
Do I need to update PCIe Retimer FW?
**[NV]** Could you please help to provide your current FW version?

**Question3**
Do I need to change the Device Tree?
Also, do I need to update AURIX FW?
**[NV]** No, no change needed in device trees.
Moreover, I don't think this related to Aurix, but please paste your Aurix version here. Thanks.

**Question4**
If there is any additional work that needs to be done, could you tell me specifically
[NV] Development guide provides all the steps you need. There shouldn't be any other extra steps.
Could you please list all your steps/commands in both EP and RP?
We can review if there was anything missing.

Dear @BruceYangNV ,

Question1

[NV]
You should be able to see the related messages in kernel log while you execute the following hotplug commands:
echo 0 > controllers/141c0000.pcie_ep/start
echo 1 > controllers/141c0000.pcie_ep/start

Reply =>
Unfortunately, neither EP nor RP produced any relevant Linux kernel logs (dmesg).
(no change in dmesg)

Question2

[NV]
Could you please help to provide your current FW version?

=> Unfortunately, I couldn’t confirm it. How do I check the PCIe Retimer FW version currently in use at DDPO?

Question3

[NV]
please paste your Aurix version here. Thanks.

Reply =>
1.46.15

Question4

[NV]
Development guide provides all the steps you need. There shouldn’t be any other extra steps.
Could you please list all your steps/commands in both EP and RP?
We can review if there was anything missing.

Reply =>
I followed the instructions in the following URL.

■RP
$ sudo modprobe nvscic2c-pcie-epc

■EP
$ sudo modprobe nvscic2c-pcie-epf
$ sudo -s
$ cd /sys/kernel/config/pci_ep/
$ mkdir functions/nvscic2c_epf_22CC/func
$ echo 0x10DE > functions/nvscic2c_epf_22CC/func/vendorid
$ echo 0x22CC > functions/nvscic2c_epf_22CC/func/deviceid
$ ln -s functions/nvscic2c_epf_22CC/func controllers/141c0000.pcie_ep
$ echo 0 > controllers/141c0000.pcie_ep/start
$ echo 1 > controllers/141c0000.pcie_ep/start

Finally, I typed the lspci command (a Linux command) on the RP’s console, but the EP’s DeviceID did not appear in the PCIe device list.

Thanks,

Q1. You should get these kind of information after echo 0 & echo 1 to controllers/141c0000.pcie_ep/start
------dmesg----
[ 421.843994] tegra194-pcie 141a0000.pcie: host bridge /pcie@141a0000 ranges:
[ 421.844023] tegra194-pcie 141a0000.pcie: IO 0x003a100000…0x003a1fffff → 0x003a100000
[ 421.844030] tegra194-pcie 141a0000.pcie: MEM 0x2b28000000…0x2b2fffffff → 0x0040000000
[ 421.844033] tegra194-pcie 141a0000.pcie: MEM 0x2740000000…0x2b27ffffff → 0x2740000000
[ 421.951432] tegra194-pcie 141a0000.pcie: Link up
[ 421.951550] tegra194-pcie 141a0000.pcie: PCI host bridge to bus 0005:00

Q2. Actually, we couldn’t know the current version of FW.
But you can follow the page PCIe Retimer
to flash it to the latest version.

Q4. It looks correct. Did you hit any error messages when you executed these commands?

Dear @BruceYangNV ,

About Q1
I checked dmesg and there was no message about 141c0000 .pcie (Messages such as Initialize, Link Up …).

About Q4
No error message was printed.

If there is anything else to check, please instruct me.

thanks,

Q1, if you cannot get those kernel messages which might mean you didn’t connect 2 P3710s correctly. Could you describe how you connect your P3710s? What cable are you using ? Where did you get the miniSAS cable? Thanks.

Q2. If you are sure about your connection is correct. Then please try to update the retimer FW to see if that works.

Q4. Please make sure 2 P3710s connect correctly first, then set up those commands again.

Dear @BruceYangNV ,

About Q1
The following mini SAS cables are used to connect the P3710 to each other.

P/N 716195-B21
image

Can you give me information on mini SAS cables that have been used on your company?

About Q2 and Q3
I updated the Retimer FW on two P3710’s and nothing changed.

thanks,

Hi Shibata-a,
Could you please help to confirm that your two P3710s connection as below? Thanks.
(Orin1) miniSAS port -A <–> (Orin2) miniSaS port -B

Dear @BruceYangNV

I tried the following configuration as instructed. Unfortunately, it didn’t improve.
I also tried other port combinations, but unfortunately they didn’t improve.

(Orin1) miniSAS port -A <–> (Orin2) miniSaS port -B

If there is anything else to check, please instruct me.

thanks,

Hi Shibata-a,
We didn’t hit your issue before.
According to your reply, the kernel messages didn’t show any information when EP was connected to RP which means your HWs didn’t connect two P3710 correctly.
Please help to check the following items:

  1. Could you please help to check your bind command?
    ( The board name should be based on your devices)
    For Orin1:
    cd /drive/drive-foundation/make
    ./bind_partitions -b p3710-10-a01-f1 linux -s 1
    For Orin2:
    cd /drive/drive-foundation/make
    ./bind_partitions -b p3710-10-a01-f1 linux -s 2

  2. Please help to check your mini-SAS cable:
    If you are sure about you connect two P3710s correctly through mini-SAS cable, you may need to check if your mini-SAS cable is functional and in good condition. (We have no way to check your cable, you might need to do it on your own).

  3. One thing I can think of is to check your P3710 each by each. If you can confirm your cable is good, then plug the cable from port-A to port-B on the same P3710.
    3-a, try to rebind (drop the option “-s 1” or “-s 2” at binding) & flash the image.
    3-b, insert kernel modules at on same P3710:
    sudo modprobe nvscic2c-pcie-epc
    sudo modprobe nvscic2c-pcie-epf
    3-c, execute the rest commands as your test before as below to check it again.
    cd /sys/kernel/config/pci_ep/
    mkdir functions/nvscic2c_epf_22CC/func
    echo 0x10DE > functions/nvscic2c_epf_22CC/func/vendorid
    echo 0x22CC > functions/nvscic2c_epf_22CC/func/deviceid
    ln -s functions/nvscic2c_epf_22CC/func controllers/141c0000.pcie_ep

    echo 0 > controllers/141c0000.pcie_ep/start
    echo 1 > controllers/141c0000.pcie_ep/start
    3-d, it this P3710 is not working then check another P3710. By doing this, we should be able to know if one of them is broken. Two P3710s are broken at the same time doesn’t make sense to me.

Thanks.

Dear @BruceYangNV ,

I’d like to confirm something.

About Q1, Q3-a
Regarding the command you instructed
(cd /drive/drive-foundation/make), do you execute it with P3710(Orin1)?

I tried to execute it on P3710(Orin1) as instructed by you and unfortunately, the result is as follows.

bash: cd: /drive/drive-foundation/make: No such file or directory

What should I do?
By any chance, do I need to run the bind command on the host PC??

Also, why do I need to re-flash?
Can I use SDKManager for re-flushing?

thanks,

Hi Shibata-a,
Do you have the pdk 6.0.4.0 ?
If yes, then try to go the the following folder (previous command needs to be corrected as below)
cd drive-foundation/make
(If you use SDK manager to download the image, the path should be DRIVE_OS_6.0.4_SDK_Linux_DRIVE_AGX_ORIN_DEVKITS/DRIVEOS/drive-foundation/make)

By any chance, do I need to run the bind command on the host PC??

I think the answer is YES.

Also, why do I need to re-flash?
Can I use SDKManager for re-flushing?

I think the default image on P3710 didn’t set the correct soc_id by default.
Furthermore, If I remember correctly, the SDK manager doesn’t support option “-s”.
Thanks.

Dear @BruceYangNV ,

According to your instructions, 1. has the “-s” option but 3-a. says to remove the “-s” option.
Which is right??
↓↓↓↓↓

Hi Shibata-a,
Item 1 is for two Orins connected. If it still fails, then go check Item 3 which is for checking P3710 each by each.

One thing I can think of is to check your P3710 each by each

Dear @BruceYangNV ,

As per your instructions, I connected the mini SAS cable to ports A and B of the same P3710 and checked, but unfortunately the kernel message(dmesg) was not output.
I checked it on other P3710 and the results were the same on both.

I also prepared and checked another brand new mini SAS cable and unfortunately the results were unchanged.

Please provide information on the mini SAS cable you used.
If there is anything else I should check, please let me know.

Thanks,

@shibata-a

Please double confirm following the below steps:

  1. Connect miniSAS Port-A of NVIDIA DRIVE AGX Orin Devkit (As RP)-1 to miniSAS Port-B of NVIDIA DRIVE AGX Orin Devkit – 2(As EP) with a PCIe miniSAS cable.

  2. Boot DRIVE AGX Orin Devkit 2 (act as EP) First, then

  • Load EPF kernel module

sudo modprobe nvscic2c-pcie-epf

  • PCIe EPF hot plug

sudo -s
cd /sys/kernel/config/pci_ep/
mkdir functions/nvscic2c_epf_22CC/func
echo 0x10DE > functions/nvscic2c_epf_22CC/func/vendorid
echo 0x22CC > functions/nvscic2c_epf_22CC/func/deviceid
ln -s functions/nvscic2c_epf_22CC/func controllers/141c0000.pcie_ep
echo 0 > controllers/141c0000.pcie_ep/start
echo 1 > controllers/141c0000.pcie_ep/start

  1. Next Boot DRIVE AGX Orin Devkit 1 (act as RP). After booting complete
  • Verify the PCIe link of Orin Devkit 2 (as EP) is detected:

sudo lspci | grep NVIDIA

c3:00.0 Serial controller: NVIDIA Corporation Device 22cc

  • if PCIe Link is detected ( lspci command reports “c3:00.0 Serial controller”), load EPC kernel module

sudo modprobe nvscic2c-pcie-epc

  • if PCie link is not detected, try reflash retimer firmware and check the MiniSAS cable ( it happened , two minisas cable had broken in our setup)

Please also provide the information from the command. Thanks.

$ cat /proc/device-tree/chosen/nvidia,sku_version

Dear @VickNV ,

I tried with the instructions you gave, but the RP did not recognize the EP as a PCIe device.

I checked the “nvidia, sku_version” of 2 P3710.
The confirmation results are as follows.

1st P3710:  D00
2nd P3710:  TS5

↑The value is different, is there any problem?

Also, if there is anything else you need to check, please let me know.

thanks,

@shibata-a would like to double check that you also ran the commands specifying SoC IDs? Thanks.