Integration between the Jetson AGX Orin and ConneectX-6 DX 100GbE card

This is the dmesg output after the unbind & bind:
dmesg output - after bind and unbind pcie@141a0000 (75.8 KB)

What are you looking at?

Hi,

Could you make a summary of how many such ConnectX-6 DX cards on your side have such problem?

Also, could you check if the problem is on the card or on specific jetson?

Sure.

I have 20 ConnectX-6 Dx cards, 6-7 of them get detected on most of not all Jetson machines.
The rest get detected on only a few Jetson machine or not at all.

The problem not seem to be with the cards because I tried to connect all of them to intel based desktop that I have and they all got detected right away.
So it is really looks like a power issue.

Hi,

Let me check with internal team if this card requires some specific gpio to be enabled to make it work.

This is a common cause that a card cannot get detected.

In the meantime, I notice your previous test (in the beginning of this thread) was based on some previous jetpack version. Not sure if upgrading makes the situation worse or better . Could you also test more jetsons to clarify that?

And one question which I forgot to ask.

Are all the jetson boards here NV devkit or some custom board?

And some tricks mentioned in our document to debug such issue.

https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/HR/JetsonModuleAdaptationAndBringUp/JetsonAgxOrinSeries.html?highlight=pcie#debug-pcie-link-up-failure

PCIE_RP_APPL_DEBUG_0 register address is 0x141a00d0 for your case.

Reduce the link speed to Gen-1 means adding “max-link-speed = <1>;” property to your pcie controller as how you added nvidia,disable-power-down.

By testing more Jetsons you mean with or without adding the “nvidia,disable-power-down” flag?

All the Jetson boards are 64Gb NV devkit.

I will try to reduce the link speed to Gen-1.

  1. My point is you have multiple jetson devices and NIC cards on yoiur side.
    Some of your jetson seems using previous jetpack version, while some of them are already on jetpack5.1.2.
    I hope we can get the software version aligned first to make sure they are not the problem.

Whether you want to add disable-powerdown does not matter. You could add it to make it has more chance to get detected.

  1. All the Jetson boards are 64Gb NV devkit.

I don’t think what you said is true. The device tree from some of your dmesg showed they are 32GB modules…

I don’t think what you said is true. The device tree from some of your dmesg showed they are 32GB modules…

I’m almost certain that both dmesg logs I shared here are from the same machine and it’s a Jetson AGX Orin 64GB devkit.

Adding the “nvidia,max-link-speed = <1>;” flag to the pcie@141a0000 hasn’t change anything (NIC still not detected).

I’m almost certain that both dmesg logs I shared here are from the same machine and it’s a Jetson AGX Orin 64GB devkit.

Hi,

Integration between the Jetson AGX Orin and ConneectX-6 DX 100GbE card - #26 by dekelram96

This one is from Orin 64GB.

These two are not Orin 64GB because the device tree is for sku0. Sku0 is 32GB DRAM.

If you are sure you are using 64GB module even on these two, please check your board with below command.

sudo tegrastats

The purpose here is to align and make sure what module would trigger this issue. Only 64GB or only 32GB or both kinds would hit.

I will check, but is there any difference between the 32GB and the 64GB that’s effect the detection of the NICs or effect the power in the power up phase?

Ideally, it should be same. But they are using different software and may lead to behavior.

As my previous comment, I would like to clarify if this issue happened only to one kind of module or every kind.

This issue happened on every module I have (so every kind).
With the add of the two flags for pcie@141a0000:

nvidia,disable-power-down;
nvidia,max-link-speed = <1>;

I got the same result - no detection.
With the second Jetson we grounded the NIC (by adding a wire that was connected to the metal part of the NIC and to the metal part of the back of my PC) - 2 out of 7 times the NIC got detected.

Any ideas?

Let me check with internal team if this card requires some specific gpio to be enabled to make it work.

Plus, any new leads with that?

Hi,

Not “nvidia,max-link-speed”. Only max-link-speed = <1>;.

Not every property needs “nvidia” as prefix.

After removing the “nvidia” prefix in the “max-link-speed = <1>” flag - after the first reboot the card got detected, then I tried 3 more times but the card was not detected.

Could you share the lspci -vvv result when the card got detected? With or without setting max speed does not matter.

Sure.
lspci -vvv output in detection (3.6 KB)

Since it looks like you are not familiar with this, then sorry for my miss in previous command.

You need to run lspci with sudo. You can read the file you shared and you will know why.

Sorry about it, I missed the “Capabilities” field.
This one is with the sudo privileges - lspci -vvv output in detection.txt (22.2 KB)

I don’t know which bit belongs to which error, but you have some uncorrected errors (bit 4 since first error pointer is 04, but I don’t know which is most significant or least significant bit):

AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-

There were other errors as well. Just search for “First Error Pointer”, and ignore it if the pointer is 00. The errors were uncorrectable. I can’t tell you if those were logic or checksum errors.

You might find this of interest:
http://trac.gateworks.com/wiki/PCI