PCIe and FAN issue

Hi,
We are having an issue with one of our Xavier boards where some PCIe cards do not work. I have tried 3 PCIe cards an Intel I350 and a Intel XL710 both working ok. But I also have one Intel X520 not working.
All these cards work ok on our other Xavier.

If i review the forum it looks very similar issue as https://devtalk.nvidia.com/default/topic/1048223/jetson-agx-xavier/-3-3v-missing-on-jetson-xavier-pcie-slot/
where nvidia,plat-gpios is not configured in the device-tree, and could explain that 3.3v rail is not enabled.

I collected some info from our system, and it looks similar to the other thread.

root@xavier:/sys/kernel/debug# ls /proc/device-tree/chosen/plugin-manager/ids
2888-0001-400 name XXXX-XXXX-XXX

I don’t have any "I> node /plugin-manager/fragment-pcie-p2822-B00 matches in the cboot log.

Comparing between a working Xavier and this board i see that gpio@2200000/pcie-reg-enable is set to disabled.

cat /proc/device-tree/gpio@2200000/pcie-reg-enable/status
disabled

The other fault on the same board is that the FAN_PWM output is inverted, resulting in the FAN starts at max rpm. If i change the device-tree
Additionally, to this, it looks like FAN_PWM signal is inverted, so the board starts up with the FAN at max rpm. When the board gets hotter it will turn the fan off.

I made this workaround in the device-tree that seem to work as a workaround, but i suspect there are some other reason for the issue.

--- a/kernel-dts/t19x-common-platforms/tegra194-pwm-fan.dtsi
+++ b/kernel-dts/t19x-common-platforms/tegra194-pwm-fan.dtsi
@@ -25,7 +25,7 @@
                state_cap_lookup = <2 2 2 2 3 3 3 4 4 4>;
                pwm_period = <45334>;
                pwm_id = <4>;
-               pwm_polarity = <PWM_POLARITY_INVERTED>;
+               pwm_polarity = <PWM_POLARITY_NORMAL>;
                suspend_state = <1>;
                step_time = <100>; /* mesecs */
                state_cap = <7>;

Any suggestions, currently i suspect there are some hardware issue with this board, possible with some of the power distribution.

Br,
S

I checked the device tree again. Looks like the fan polarity is set to inverted because missing ids = “>=2822-0000-400”.

Are you using your own custom carrier board?

This explains the fan issue. Any ideas how to troubleshoot the missing 2822-0000-400?

This is not a custom board. I used the included carrier board together with JetPack 4.1.1 rootfs / kernel.

Hi,

Thanks for your reply. Could you dump the eeprom value of your carrier board by using

sudo i2cdump -y 0 0x56

Hi,

Looks empty.

root@xavier:~# sudo i2cdump -y 0 0x56
No size specified (using byte-data access)
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
40: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
50: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
60: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
70: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
90: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
b0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
c0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
d0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
e0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
f0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …

How many devkit carrier boards and modules do you have? Are you always use same devkit+ module to do the test?

You could try to put this problematic module to different carrier board and see if i2cdump can give out result or not.

If eeprom gives out nothing, it may cause the id to miss. Please confirm if there is only one carrier board that gives out empty.

I have 2 kits on my desk at the moment.

I tried the i2c command on both of them, on the working module i get the following.

root@xavier:/etc# sudo i2cdump -y 0 0x56
No size specified (using byte-data access)
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
00: 01 00 ff 00 06 0b 00 00 06 44 00 00 00 00 00 00 ?..??..?D…
10: 00 01 b8 20 36 39 39 2d 38 32 38 32 32 2d 30 30 .?? 699-82822-00
20: 30 30 2d 36 30 30 20 44 2e 30 00 00 00 00 00 00 00-600 D.0…
30: 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
40: ff ff ff ff ff ff ff ff ff ff 30 34 32 34 31 31 …042411
50: 38 30 33 38 33 37 36 00 00 00 00 00 00 00 00 00 8038376…
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
90: 00 00 00 00 00 00 46 46 46 46 ff ff 46 46 ff ff …FFFF…FF…
a0: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff …
b0: ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 78 …x

I later swappet the faulty carrier board to the working module and i get an empty output from i2cdump.

Also when putting the working carrier board on the faulty module, it starts to work. So i think i can say the error is following the carrier board. And possible that the eeprom is not written correctly?

Is there a way i can update the eeprom, and see if this fixes the carrier board that is not working?

You could just copy the value from working carrier to the broken one by using i2cset tool.

I have copied the data from the other carrier and now everything seem to work again.

tbuser@xavier:~$ cat /proc/device-tree/gpio@2200000/pcie-reg-enable/status
okay
tbuser@xavier:~$ ls /proc/device-tree/chosen/plugin-manager/ids
2822-0000-600 2888-0001-400 name
tbuser@xavier:~$ lspci
0001:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad2 (rev a1)
0001:01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9171 (rev 12)
0003:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1)
0003:01:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
0003:01:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Thanks.