Nvethernet PTP bug

Hmm, that’s weird. We’re using Connecttech Forge carrier. It specifically tells the two 10 GbE ports

use AQR113 PHYs Originating from MGBE0 and MGBE1 on the AGX Orin Module.

I see the AGX Orin datasheet has been revised to talk about only one MGBE controller. But I physically see two 10 GbE ports on the carrier board and both work. And lspci doesn’t list any of these ports, so they can’t be PCI-connected. Also, they use the nvethernet driver. That’s why I conclude the MACs for these ports have to be on the module and not on the carrier board.

Nevertheless, issue 2 is a clear bug. You can see it just by examining the code. Try running sudo hwstamp_ctl -i eth0 -r 12 -t 1 on any nvethernet-handled interface and watch dmesg.

Regarding the other two bugs, it’s irrelevant if there is only one or two 10 GbE ports. So you can easily test it with the devkit. Just connect the 10 GbE port to another device. Just use eth0 instead of mgbe0 in the reproducers.

Issue 1 might be actually a problem of the AQR113 PHY used by ConnectTech. I couldn’t find explicit info about which type of PHY is on the devkit (datasheet only speaks of “a MGBE PHY”), but I guess it could be the same AQR113 (based on the device tree file of the devkit which uses the same nvidia,eqos-mdio PHY and there is a definition mgbe0_aqr113c_phy = "/ethernet@6810000/mdio/ethernet_phy@0").

Issue 3 also seems much more related to nvethernet driver than to anything ConnectTech could do.

Hi,

The problem here is we already tried to reproduce your issue many times this weekend but we cannot reproduce it.

We actually don’t support this 2 MGBE design. ConnectTech did this by themselves.
There are many kind of customizations that added by the vendors. The SoC capability is not equal to the software scope we can support.

I know it is not your duty and business about their design and why NV cannot support it.
But as I cannot reproduce this locally, please report this issue with ConnectTech. I will let them discuss this issue with us directly.

Or you can try to get one NV devkit and try to use that to reproduce this issue. I guess you probably don’t want to do that.

For some detail about this issue.

Try running sudo hwstamp_ctl -i eth0 -r 12 -t 1 on any nvethernet-handled interface and watch dmesg.

Yes, we tried that this afternoon for running about 30 times in consecutive way. Our nvethernet driver directly showed unsupported ioctl. But not the issue you hit.

sudo ptp4l -H -i eno1 -f /etc/linuxptp/automotive-master.cfg --step_threshold=0.001 -m -l6

As for this, we ran this over a weekend and error you mentioned also not occurred.
We will adjust the timeout value and test more.

Okay, thanks a lot for trying!

We should have one devkit. I’m just not sure if it’s not in use currently. I’ll have a look and try to get it for testing.

Hmm, this is a very important thing given the AGX Orin modules were first advertised with the 4 MGBEs. I haven’t found any rationale for this change, just your announcements here on the forum. Did NVidia stop supporting the 4 MGBEs because of a known design problem, or is it just to limit the scope of support? Can we rely on the ConnectTech setup to work (if it’s done properly from their side)?

Do you use the master branch of linuxptp? I remember there was a problem I fixed there around a year ago regarding this.

Hmm, this is a very important thing given the AGX Orin modules were first advertised with the 4 MGBEs. I haven’t found any rationale for this change, just your announcements here on the forum. Did NVidia stop supporting the 4 MGBEs because of a known design problem, or is it just to limit the scope of support? Can we rely on the ConnectTech setup to work (if it’s done properly from their side)?

Unfortunately, the Jetson itself does not have 4 MGBE (or more than 1MGBE case) validated on any boards we have on our side. The first announcement last year was a mistake because it was actually an experience directly from Drive platform but not Jetson. This thing was never validated on Jetson. And no QA to help validate the functionality in every release. That is “no support” means. A supported feature will be validated in each release and if there is a bug found, we would try to fix.

That is why we removed that thing from the design guide document.
If you don’t think my comment is a official one, then you can refer to the design guide document. That one has already been removed for a year.

We are not possible to debug any issue without any carrier boards here. At least our partner ConnectTech should help us and file a bug ticket for us to fix this part.
It is actually their duty to report bug to us, but not through you. ConnectTech has more info about their board than you.

Do you use the master branch of linuxptp? I remember there was a problem I fixed there around a year ago regarding this.

Ok, if the linuxptp version matters then I will install it again.

Please also help clarify if issue 2 is able to reproduce on devkit. Thanks.

I’ve got my hands on the devkit. Results are:

  • Issue 1: Happened with tx_timestamp_timeout set to 1 and connected via 10 Gbps link. MTBF was around 60 secs. However, this error rate started only after some time running. At first, I thought the errors are not there, but after leaving it be for approx. 1 hour, I saw log with this amount of errors.
    • Testing now with timeout set to 10, will report results later.
    • To clean up the log a bit so that you see the error, pass -l6 instead of -l7 to ptp4l on Orin. It would show everything that’s needed.
    • Also, is the PTP sync actually running in your testbed? I.e., are you seeing lines starting with rms XXXX max YYYY logged every second, with XXXX being a relatively small number (< 10 000)?
    • You can also try changing delay_mechanism to E2E. But you need to do it on both master and slave computers. This is the only deviation from the default configs I use.
  • Issue 2: Replicated (with sudo hwstamp_ctl -i eth0 -r 12 -t 1) and linuxptp from master branch
  • Issue 3: Doesn’t happen. Not sure why. I’ll try flashing another module with CT carrier to figure out if it’s not just some glitch from the existing configuration.

I completely trust your comments to be valid ;) I just haven’t seen any reason for the removal. With such an important change, it desires one, I’d say. So current status is that the ConnectTech Forge board works with both 10 GbE ports, but some future update might break it. Do I get it correctly?

I’m a bit confused because their DT config says:

ODMDATA="gbe-uphy-config-22,hsstp-lane-map-3,nvhs-uphy-config-2,hsio-uphy-config-0,gbe0-enable-10g,gbe1-enable-10g"

Which should select Config # 1 with only one mgbe (mode 22, not 25). But both mgbe0 and mgbe1 show up and work. Do you have an explanation for this?

Hi,

Sorry for one more question. What is the exact setup for running issue 2?
You mentioned linuxptp from master branch, but I thought what I only need to run here is hwstamp_ctl again and again. When should I run linuxptp in issue 2 case?

Which should select Config # 1 with only one mgbe (mode 22, not 25). But both mgbe0 and mgbe1 show up and work. Do you have an explanation for this?

I don’t want to explain this too much as others may think it is supported. What I can say is that ODMDATA is correct for their mgbe cases.

I have tested Issue 1 overnight on the devkit with the default (10 ms) tx_timestamp_timeout. The issue manifests on the devkit, too. These are the timestamps (in seconds since boot) when the errors happened: 13992, 30352, 35835, 40432, 44913, 45446, 45990, 47165, 49724, 64712, 66243 . The errors can be seen easily using sudo journalctl -b -p err -f.

Setup of the Orin for all issues is (this should work on a freshly flashed L4T):

# Do not install linuxptp/ptp4l from APT sources
sudo apt install git build-essential 
git clone https://github.com/richardcochran/linuxptp
cd linuxptp/
make -j8 && sudo make -j8 install
sudo rsync -r configs/ /etc/linuxptp
# Edit /etc/linuxptp/automotive-slave.cfg and change delay_mechanism to E2E

Issue 2 doesn’t even require an Ethernet cable to be connected. Just run sudo hwstamp_ctl -i eth0 -r 12 -t 1 six times and then look for Maximum registrations reached error in sudo dmesg | tail.

Issue 1 (now confirmed to be happening on devkit SKU 0 (32 GB RAM), too) requires a functional PTP network, i.e. another computer connected to the eth0 port (via 10 Gbps link) that is running ptp4l with automotive-master.cfg config. If the computers are directly connected, you can keep delay_mechanism at both at P2P. If you run over a switch, it is better to change delay_mechanism on both to E2E. A functioning PTP master will not output anything on the console. A functioning PTP slave will be outputting every second either rms XXXX max YYYY or master offset XXXX.

Regarding Issue 3 I’ve now freshly flashed another module with Forge carrier and the issue is observable there, too. So far it seems to be a problem just with the Forge carrier I can’t replicate on devkit, so let’s not care about this until I find out more. I’ll be trying to flash older Jetpacks to find the exact version when this started happening. I know that with L4T 35.1 it did not happen.

I’ve now tested all issues on a 32 GB module (SKU p3701-0004) mounted on devkit baseboard running a freshly flashed L4T from SDK Manager.

Issue 1 can be replicated.

Issue 2 can be replicated.

Issue 3 can be replicated on L4T 35.4.1, but cannot be replicated on 35.3.1. I also tried testing Issue 3 on Jetson+Forge. The issue was on 35.4.1, it wasn’t in 35.2.1, and 35.3.1 got me kernel panic in function ether_get_time as soon as I launched ptp4l.

I’m a bit confused why I don’t see Issue 3 on devkit with devkit module running 35.4.1. Fortunately, the system we need on the devkit runs from NVMe, so I’ll try a clean flash to eMMC and see if I can replicate the issue there.

EDIT: Uh-oh, I figured out that the unpacked rootfs I patched with ConnectTech BSP is also reused by SDK manager when flashing the devkit. So the devkit got flashed with the BSP rootfs. I now reflashed a clean 35.4.1 L4T on the devkit and Issue 3 can’t be replicated. So it’s apparent Issue 3 is a ConnectTech-related issue. Feel free to ignore it. (I did one more confirmation: pure L4T doesn’t have Issue 3 on devkit; if I just swap in the kernel Image file from the BSP and nothing else, the issue appears; so it’s definitely the BSP kernel that causes it).

To be absolutely sure, I retested Issues 1 and 2 on the correctly flashed devkit (devkit module + devkit board) and both issues can be replicated on the vanilla rootfs.


For completeness, here’s the kernel panic log from 35.3.1 with ConnectTech Forge BSP (but I don’t expect NVidia to do anything about that):

[   76.882845] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[   76.883157] Mem abort info:
[   76.883247]   ESR = 0x96000044
[   76.883343]   EC = 0x25: DABT (current EL), IL = 32 bits
[   76.883501]   SET = 0, FnV = 0
[   76.883598]   EA = 0, S1PTW = 0
[   76.883738] Data abort info:
[   76.883823]   ISV = 0, ISS = 0x00000044
[   76.883949]   CM = 0, WnR = 1
[   76.884053] user pgtable: 4k pages, 48-bit VAs, pgdp=000000016b41f000
[   76.884241] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[   76.884440] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[   76.884610] Modules linked in: nvidia_modeset(OE) fuse lzo_rle lzo_compress zram ramoops reed_solomon loop nvgpu snd_soc_tegra186_asrc aes_ce_blk snd_soc_tegra210_ope crypto_simd snd_soc_tegra186_dspk snd_soc_tegra210_iqc snd_soc_tegra186_arad snd_soc_tegra210_mvc cryptd aes_ce_cipher snd_soc_tegra210_afc snd_soc_tegra210_adsp ghash_ce snd_soc_tegra210_admaif snd_hda_codec_hdmi sha2_ce snd_soc_tegra210_dmic snd_soc_tegra_utils snd_soc_tegra210_adx snd_soc_tegra_pcm sha256_arm64 snd_soc_tegra210_amx snd_hda_tegra snd_soc_tegra210_mixer sha1_ce snd_soc_simple_card_utils snd_soc_tegra210_sfc snd_soc_tegra210_i2s snd_hda_codec nvadsp snd_soc_spdif_tx pwm_fan snd_soc_tegra210_ahub snd_hda_core userspace_alert tegra_bpmp_thermal nct1008 ina3221 tegra210_adma spi_tegra114 nvidia(OE) binfmt_misc nvmap ip_tables x_tables [last unloaded: mtd]
[   76.919736] CPU: 0 PID: 2086 Comm: ptp4l Tainted: G           OE     5.10.104-tegra #33
[   76.927866] Hardware name: Unknown Jetson AGX Orin/Jetson AGX Orin, BIOS 3.1-32827747 03/19/2023
[   76.936791] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[   76.942849] pc : ether_get_time+0x7c/0xf0
[   76.947026] lr : ether_get_time+0x74/0xf0
[   76.951226] sp : ffff800022b6bc20
[   76.954638] x29: ffff800022b6bc20 x28: ffff3340f0bc5700
[   76.960150] x27: 0000000000000000 x26: 0000000000000000
[   76.965663] x25: 0000000000000000 x24: 0000000000000000
[   76.971177] x23: ffff3340c97e0940 x22: 0000000000000000
[   76.976689] x21: 0000000000000000 x20: ffff3340f0bc5700
[   76.982202] x19: ffff3340c97e0e50 x18: 0000000000000000
[   76.987714] x17: 0000000000000000 x16: 0000000000000000
[   76.993226] x15: 0000ffffc8c18588 x14: 0000000000000000
[   76.998651] x13: 0000000000000000 x12: 0000000000000000
[   77.004075] x11: 0000000000000000 x10: 0000000000000000
[   77.009589] x9 : 0000000000000000 x8 : 0000000000000000
[   77.014925] x7 : 0000000000000001 x6 : ffff3340caa721c0
[   77.020263] x5 : ffff3340caa72128 x4 : 3b9ac9ffc46535ff
[   77.025688] x3 : 0044b82fa09b5a53 x2 : ffffc8c8f926e170
[   77.031026] x1 : 0000000000000038 x0 : 0000000026a5ea5c
[   77.036366] Call trace:
[   77.038815]  ether_get_time+0x7c/0xf0
[   77.042329]  ptp_clock_adjtime+0x11c/0x180
[   77.046525]  pc_clock_adjtime+0x70/0xc0
[   77.050278]  do_clock_adjtime+0x68/0xb0
[   77.054040]  __do_sys_clock_adjtime+0x44/0xa0
[   77.058329]  __arm64_sys_clock_adjtime+0x28/0x40
[   77.063061]  el0_svc_common.constprop.0+0x80/0x1d0
[   77.067864]  do_el0_svc+0x38/0xb0
[   77.071110]  el0_svc+0x1c/0x30
[   77.074253]  el0_sync_handler+0xa8/0xb0
[   77.078016]  el0_sync+0x16c/0x180
[   77.081256] Code: aa1303e0 941b0b50 52800018 294803e1 (a90002a1)
[   77.087396] ---[ end trace 22bfc89ef4cd4fc4 ]---
[   77.096627] Kernel panic - not syncing: Oops: Fatal exception
[   77.097532] SMP: stopping secondary CPUs
[   77.101537] Kernel Offset: 0x48c8e90b0000 from 0xffff800010000000
[   77.107414] PHYS_OFFSET: 0xffffccc040000000
[   77.111529] CPU features: 0x0040006,4a80aa38
[   77.115903] Memory Limit: none
[   77.123588] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---