Nvethernet PTP bug

I found two issues running linuxptp on Jetson AGX Orin 32 GB + Forge carrier using the 10 GBe port. If I get the docs correctly, the carrier should not influence this as it only adds a PHY, but timestamping happens in the MAC which is integrated into the tegra core.

I run the default L4T from Jetpack 5.1.1 with carrier BSP.

1) Sometimes, polling for TX timestamp takes long

The default setting for tx_timestamp_timeout in linuxptp is 10 milliseconds. With other NICs, this is never a problem. But with the 10 GBe port, it happens irregularly that this timeout is too short and timestamp polling fails (and I don’t transmit almost anything over the port). I’m now trying with 20 ms and haven’t seen the problem yet. Can you confirm such a long delay is expected?

When this happens, dmesg is silent and linuxptp debug log says:

ptp4l[14427.236]: rms   29 max   49 freq +16416 +/-  40 delay  8765 +/-   2
ptp4l[14428.235]: rms   36 max   65 freq +16438 +/-  45
ptp4l[14428.764]: port 1 (mgbe0): delay timeout
ptp4l[14428.774]: timed out while polling for tx timestamp
ptp4l[14428.774]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[14428.774]: port 1 (mgbe0): send delay request failed
ptp4l[14428.774]: port 1 (mgbe0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
ptp4l[14428.824]: waiting 2^{4} seconds to clear fault on port 1 (mgbe0)
ptp4l[14444.824]: clearing fault on port 1 (mgbe0)

2) tegra_register_hwtime_source adds duplicate entries

I found about this bug because of the previous bug. When the linuxptp failed to poll a timestamp, it re-set the synchronization. This re-set calls (among others) also the SIOCSHWTSTAMP ioctl handled by ether_handle_hwtstamp_ioctl in nvethernet/ptp.c . Whenever this ioctl handler is called, it calls tegra_register_hwtime_source from platform/tegra/ptp-notifier.c .

This function has a 5-item buffer for callbacks and net devices. However, when it is called multiple times for the same device, it doesn’t figure out it already has a callback for this device and adds a new one. After 5 calls, the callback buffer is full and this starts to be shown in the log:

[14551.813229] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[14551.813259] Maximum registrations reached

I think the function should check if it already has a callback for the given netdev, and if it does, it should replace it.

To replicate issue 2:

sudo hwstamp_ctl -i mgbe1 -r 12 -t 1  # Run this command several times

hwtstamp_ctl is a part of GitHub - richardcochran/linuxptp: User space PTP stack for the GNU/Linux operating system. .

Update to issue #1: even with 20 ms timeout, I got one timed-out poll after approx. 1 hour.

Could you share the commands to reproduce issue 1 too?

To replicate issue 1, download master branch of linuxptp to two computers, build and install it. Copy the configs folder to /etc/linuxptp. Connect the two computers to the same wired network.

Run this command on a different PTP-capable computer:

sudo ptp4l -H -i eno1 -f /etc/linuxptp/automotive-master.cfg --step_threshold=0.001 -m -l6

And run this on the Orin:

sudo ptp4l -H -i mgbe0 -f /etc/linuxptp/automotive-slave.cfg --step_threshold=0.001 -m -l7

You can try setting different values of tx_timestamp_timeout in the automotive-slave.cfg. The default is 10.

If you try with 10 or 5, the timeout will occur much more often.


I now let linuxptp run over the weekend with 20 ms timeout, and here’s the log of events when the timestamp polling timed out. Looking at the timestamps at the beginning of the lines, you can tell how otfen the polling fails - approx. once in 5000 seconds. Linuxptp is configured by default to send one delay request a second, and delay request packets are exactly those that need to poll the tx timestamp. So generalizing a bit, it seems that with 20 ms timeout, there is a 1/5000 chance that tx timestamp polling will fail.

[27886.568202] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[27886.568261] Maximum registrations reached
[37519.798199] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[37519.798258] Maximum registrations reached
[43891.268918] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[43891.268978] Maximum registrations reached
[48917.816830] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[48917.816897] Maximum registrations reached
[53400.743378] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[53400.743443] Maximum registrations reached
[58950.664277] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[58950.664339] Maximum registrations reached
[64958.753548] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[64958.753607] Maximum registrations reached
[67624.451120] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[67624.451184] Maximum registrations reached
[80001.141025] Maximum registrations reached
[86908.296045] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[86908.296105] Maximum registrations reached
[88827.543319] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[88827.543385] Maximum registrations reached
[91939.849184] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[91939.849240] Maximum registrations reached
[105350.831237] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[105350.831270] Maximum registrations reached
[105472.919689] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[105472.919747] Maximum registrations reached
[116061.185679] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[116061.185735] Maximum registrations reached
[118566.006938] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[118566.007006] Maximum registrations reached
[120003.362164] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[120003.362229] Maximum registrations reached
[121265.013041] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[121265.013103] Maximum registrations reached
[140214.452390] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[140214.452449] Maximum registrations reached
[146416.118631] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[146416.118693] Maximum registrations reached
[148250.666738] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[148250.666795] Maximum registrations reached
[150054.758534] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[150054.758600] Maximum registrations reached
[153808.534033] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[153808.534066] Maximum registrations reached
[173535.119174] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[173535.119237] Maximum registrations reached
[181168.894124] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[181168.894183] Maximum registrations reached
[184647.896640] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[184647.896703] Maximum registrations reached
[192097.708232] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[192097.708296] Maximum registrations reached
[196066.864019] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[196066.864078] Maximum registrations reached
[198730.586151] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[198730.586215] Maximum registrations reached
[206010.531109] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[206010.531174] Maximum registrations reached
[220313.686634] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[220313.686691] Maximum registrations reached
[220341.278806] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[220341.278872] Maximum registrations reached
[226589.942661] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[226589.942721] Maximum registrations reached
[229599.062060] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[229599.062117] Maximum registrations reached
[229921.762971] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[229921.763032] Maximum registrations reached
[234684.570617] nvethernet 6810000.ethernet: config.flags = 0x0, tx_type = 0x1,rx_filter = 0xc
[234684.570670] Maximum registrations reached

I’m seeing the same issue, but not even 20ms works that well. Was there ever a resolution here?

@WayneWWW I’m sad this bugreport hasn’t been added to L4T release notes - neither known issues, nor fixed issues.

Issue 2 has a complete analysis, now it only awaits like 10 minutes of work of your kernel developers.

Issue 1 needs more investigation on your side. It is definitely very unusual that the TX packet timestamps cannot be read back in < 20 ms.

Now I’ve updated to L4T 35.4.1 and I started seeing a new issue in linuxptp (new Issue 3):

clockcheck: clock frequency changed unexpectedly!

This happens when the freq value written in clock_adjtime()call isn’t read back later.

I’ve added a few debug prints to linuxptp/clockcheck.c at master · richardcochran/linuxptp · GitHub to print the last set value and the value read back. Here’s the result:

clockcheck: clock frequency changed unexpectedly! 520300 1036783
clockcheck: clock frequency changed unexpectedly! 371436 1036783
clockcheck: clock frequency changed unexpectedly! 247989 1036784
clockcheck: clock frequency changed unexpectedly! 142558 1036784
clockcheck: clock frequency changed unexpectedly! -25609 1036784
clockcheck: clock frequency changed unexpectedly! -90039 1036784

The values read back increase by approximately 1 each second. Given it is frequency offset, this is super weird!

The code that handles the calls to clock_adjtime() is here:

This was definitely not happening on 35.2.1. What I can google about this error is that something else is controlling the same clock. However, I’m pretty sure that doesn’t happen. The ptp4l daemon runs on mgbe0, which is tied to clock /dev/ptp2. I do not run anything else using this clock. Also, sudo lsof /dev/ptp2 confirms that.

On the other hand, the linuxptp manages to sync the clock pretty well (~ 80 ns), so it seems it is just the reported freq value that is wrong, but the clock actually reacts to the given adjustment commands.

Hi @peckama2

Sorry that it was my mistake. This was not recorded in my list so I forgot to track this.

I will use rel-35.4.1 to check this issue.

BTW, what is the method to reproduce error 3?

Thanks, the repro for Issue 3 is the same as for Issue 2.

1 Like

But the reproduce way is just running this several times, do you mean sometimes you hit issue 3 and sometimes you hit issue 2?

Ah, I’m sorry, I meant the reproducer for Issue 1. It means running PTP slave on Jetson and PTP master on some other computer connected to the mgbe0 port.

Could you help clarify how long will it take to reproduce this issue (both issue 1 and 2)?

We tried to reproduce this but so far we didn’t see it.

Thank you for trying.

Issue 2 should appear after 5 or 6 runs of the command. You don’t even need a cable connected to the port.

Issue 1 is more tricky. After the update to 35.4.1, I haven’t seen this happening (yet) even with the default 10 ms timeout. However, if you want to first just verify you can get the same error, decrease tx_timestamp_timeout to 1 ms and you should get the error within a minute. It also seems to be tied to the link speed. When connected to a 10 Gbps link, even with 1 ms timeout it takes about a minute to get the error. On a 1 Gbps link, I got the error almost instantly (~ 5 seconds). Unfortunately, I don’t remember how often I got the error with 10 ms timeout when I reported the issue. It definitely wasn’t too often. You can see in Nvethernet PTP bug - #6 by peckama2 that with 20 ms timeout, the probability of error was ~ 1 in 5000 seconds.

Issue 3 appears instantly.

Ok, maybe I will start from issue 2 first as you already provided the code analysis (many thanks for that).

Reg. issue 1, I’ve now started it with 10 ms (default) timeout, 1 Gbps link and turned off the errors caused by Issue 3 to have a clearer view of the log. Now I can see that the error is still there, approx every 500 seconds.

On a 10 Gbps link I’ve registered so far two errors - one after 4000 seconds, the other after another 2000 seconds.

Since default interface name is eth0, shall I use your command as

sudo hwstamp_ctl -i eth0 -r 12 -t 1

And with port to port connection of 2 ethernet ports?

Hmm, I don’t know what you mean. These bug reports are about the 10 GbE ports. I don’t know how they’re named in devkits, but I see names mgbe0 and mgbe1 (along with eth0 and eth1 representing the 1 GbE ports).

Hi,

We actually don’t have 2 mgbe on the boards but only one. I mean actually there are no support for multiple MGBE on Orin AGX. This is mentioned in the design guide document.

If you want us to fix such issue, please reproduce this on NV devkit and tell us the method to reproduce.