Hello,
We have verified that the tx_timestamp_timeout issue still occurs with it set to NVIDIA recommendations, albeit infrequently. Does NVIDIA have any plans for a fix or mitigation?
Hello,
We have verified that the tx_timestamp_timeout issue still occurs with it set to NVIDIA recommendations, albeit infrequently. Does NVIDIA have any plans for a fix or mitigation?
Hello,
This has moved from āa mild annoyanceā to āan active blocking issueā - please respond, what is NVIDIAās plan to fix this?
We are currently checking with our team on this and will keep you informed with any updates. Your patience is appreciated. Thank you.
Hello,
Itās been a week since last update, has the team made any progress?
I tried the following command using the gPTP_slave.cfg configuration file and did not observe the issue you mentioned in MGBE PTP Timestamp Timeout - #12 by dbennington1 :
sudo ptp4l -f ./gPTP_slave.cfg -p /dev/ptp4 -i mgbe1_0 -m -l7
Could you please provide more details on how you are running ptp4l with the gPTP.cfg and gPTP_slave.cfg files? This additional information will help me better understand the issue and assist you further.
Hello,
The commands are exactly as you have them. As mentioned, it is rare, it needs to be run for multiple hours for it to occur, and appears to happen more readily when the interface is under significant load - I would recommend running iperf while running ptp4l.
Does the issue resolve itself after some time? Itās worth noting that under heavy network loads, issues like this can occur on other platforms, correct?
Hello,
It depends on what you mean by āthe issueā - you can increase the frequency of its failure by simply removing the tx_timestamp_timeout line from the config. By default, it kills PTP timesync for 16 seconds, resets the interface, causes time to jump, and then re-initializes and continues on. Losing timesync in this way in not appropriate in a realtime system, even if it comes back. It can and often does happen repeatedly at random intervals - it is not a single error.
āUnder heavy network loadsā refers on the order of 1 Gbps on a 10Gbps interface. Driver failures of the PTP timesync component is not common on other platforms when the link is ~10% utilized, and on any platform this sort of failure would be considered a serious fault of the PTP system.
Iām not quite sure what argument you are trying to make here - that PTP is expected to fail?
I guess I may be able to shed a bit more light on the exact failure mode here - very specifically, this error occurs when ptp4l queries the interface for the timestamp associated with a packet. Timestamps appear to be stored in a linked list by the driver, and the driver iterates through this list until it finds the timestamp that matches the packet in question, discarding stale timestamps as it goes (presumably because packets that have been received and not their timestamps queried do not need those entries). PTP4L waits for the driver to return a timestamp for tx_timestamp_timeout milliseconds, at which point it faults and resets the interface. Since NVIDIA sets this timeout to 1000ms (1 second), it is presumed that the driver spends more than 1 second traversing the timestamp linked list in search of the matching timestamp. Given that 1 second is an eternity when traversing a linked list, itās probable that when the number of entries is long the list corrupts itself. Assuming the packets are received (which is definitely the case), the driver should always be able to provide a timestamp, regardless of load - the interface should start dropping packets before timestamp fetching starts failing.
Are you saying that your environment is subject to a 1 Gbps network load on the 10 Gbps interface, and youāve noticed the issue occurring under these conditions? Could you provide more details about how frequently this issue arises, typically after how many hours of operation? Additionally, have you noticed whether the issue still occurs when the network load is not present?
Please perform a test by directly connecting two devkits and running the same commands I used with iperf to generate a 1 Gbps network load. This will help us determine if the issue is reproducible under these controlled conditions.
Hello,
That was indeed what I was saying, that with average network loads of at approximately 1Gbps this issue would happen, and with startling frequency - approximately once every 15 minutes.
I have performed the iperf test and was unable to replicate - I am currently attempting to create the conditions under which this fault occurred using standard test software. This happened in an environment that had many, many multicast packets, so it is possible there is an issue in there. I maintain that the driver should never fail to deliver a timestamp while still being responsive to user input, but it is clear that the conditions under which this occurred are highly specific - please stand by while I attempt to create a minimum set of conditions under which this occurs.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.