This is the dmesg output after the unbind & bind:
dmesg output - after bind and unbind pcie@141a0000 (75.8 KB)
What are you looking at?
This is the dmesg output after the unbind & bind:
dmesg output - after bind and unbind pcie@141a0000 (75.8 KB)
What are you looking at?
Hi,
Could you make a summary of how many such ConnectX-6 DX cards on your side have such problem?
Also, could you check if the problem is on the card or on specific jetson?
Sure.
I have 20 ConnectX-6 Dx cards, 6-7 of them get detected on most of not all Jetson machines.
The rest get detected on only a few Jetson machine or not at all.
The problem not seem to be with the cards because I tried to connect all of them to intel based desktop that I have and they all got detected right away.
So it is really looks like a power issue.
Hi,
Let me check with internal team if this card requires some specific gpio to be enabled to make it work.
This is a common cause that a card cannot get detected.
In the meantime, I notice your previous test (in the beginning of this thread) was based on some previous jetpack version. Not sure if upgrading makes the situation worse or better . Could you also test more jetsons to clarify that?
And one question which I forgot to ask.
Are all the jetson boards here NV devkit or some custom board?
And some tricks mentioned in our document to debug such issue.
PCIE_RP_APPL_DEBUG_0 register address is 0x141a00d0 for your case.
Reduce the link speed to Gen-1 means adding âmax-link-speed = <1>;â property to your pcie controller as how you added nvidia,disable-power-down.
By testing more Jetsons you mean with or without adding the ânvidia,disable-power-downâ flag?
All the Jetson boards are 64Gb NV devkit.
I will try to reduce the link speed to Gen-1.
Whether you want to add disable-powerdown does not matter. You could add it to make it has more chance to get detected.
All the Jetson boards are 64Gb NV devkit.
I donât think what you said is true. The device tree from some of your dmesg showed they are 32GB modulesâŠ
I donât think what you said is true. The device tree from some of your dmesg showed they are 32GB modulesâŠ
Iâm almost certain that both dmesg logs I shared here are from the same machine and itâs a Jetson AGX Orin 64GB devkit.
Adding the ânvidia,max-link-speed = <1>;â flag to the pcie@141a0000 hasnât change anything (NIC still not detected).
Iâm almost certain that both dmesg logs I shared here are from the same machine and itâs a Jetson AGX Orin 64GB devkit.
Hi,
Integration between the Jetson AGX Orin and ConneectX-6 DX 100GbE card - #26 by dekelram96
This one is from Orin 64GB.
These two are not Orin 64GB because the device tree is for sku0. Sku0 is 32GB DRAM.
If you are sure you are using 64GB module even on these two, please check your board with below command.
sudo tegrastats
The purpose here is to align and make sure what module would trigger this issue. Only 64GB or only 32GB or both kinds would hit.
I will check, but is there any difference between the 32GB and the 64GB thatâs effect the detection of the NICs or effect the power in the power up phase?
Ideally, it should be same. But they are using different software and may lead to behavior.
As my previous comment, I would like to clarify if this issue happened only to one kind of module or every kind.
This issue happened on every module I have (so every kind).
With the add of the two flags for pcie@141a0000:
nvidia,disable-power-down;
nvidia,max-link-speed = <1>;
I got the same result - no detection.
With the second Jetson we grounded the NIC (by adding a wire that was connected to the metal part of the NIC and to the metal part of the back of my PC) - 2 out of 7 times the NIC got detected.
Any ideas?
Let me check with internal team if this card requires some specific gpio to be enabled to make it work.
Plus, any new leads with that?
Hi,
Not ânvidia,max-link-speedâ. Only max-link-speed = <1>;.
Not every property needs ânvidiaâ as prefix.
After removing the ânvidiaâ prefix in the âmax-link-speed = <1>â flag - after the first reboot the card got detected, then I tried 3 more times but the card was not detected.
Could you share the lspci -vvv result when the card got detected? With or without setting max speed does not matter.
Since it looks like you are not familiar with this, then sorry for my miss in previous command.
You need to run lspci with sudo. You can read the file you shared and you will know why.
Sorry about it, I missed the âCapabilitiesâ field.
This one is with the sudo privileges - lspci -vvv output in detection.txt (22.2 KB)
I donât know which bit belongs to which error, but you have some uncorrected errors (bit 4 since first error pointer is 04, but I donât know which is most significant or least significant bit):
AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
There were other errors as well. Just search for âFirst Error Pointerâ, and ignore it if the pointer is 00
. The errors were uncorrectable. I canât tell you if those were logic or checksum errors.
You might find this of interest:
http://trac.gateworks.com/wiki/PCI