Nvethernet 6800000.ethernet: [xpcs_lane_bring_up][470][type:0x4][loga-0x0] Failed to get PCS block lock

Hi, nvidia experts,

Our platform: AGX Orin 32G with Jetson Linux 36.4, customer carrier board.
Also we connect the Orin to a 10G Marvell 6113 switch chip. The hardware topology is:
Orin MAC <------------> 6113 switch chip’s MAC

Testing step: We put 15 our boards into environment as below:
environment temperature: 75 °C,SOM internal temperature is about:
|CPU Temp (°C)|91.951|
|GPU Temp (°C)|87.823|
|SOC Temp (°C)|91.955|

Issue: 3 of 15 boards occur “nvethernet 6800000.ethernet: [xpcs_lane_bring_up][470][type:0x4][loga-0x0] Failed to get PCS block lock” errors when boot up. And the link status between Orin and 6113 switch is down, so the network in these 3 boards is unavailable.
And this issue can’t be recovered by reboot until we put these 3 boards into normal temperature environment.

According to the error logs, the code is here:

So can we optimize the xpcs_check_pcs_lock_status() to solve this issue?

thanks

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll provide a comprehensive answer to the question.

The issue described is related to the NVIDIA Jetson AGX Orin platform, specifically with the 10G Ethernet connection to a Marvell 6113 switch chip. The problem occurs when the system is exposed to high temperatures (75°C), causing 3 out of 15 boards to fail to establish a link between the Orin MAC and the 6113 switch chip’s MAC. The error message “nvethernet 6800000.ethernet: [470][type:0x4][loga-0x0] Failed to get PCS block lock” is reported, and the network becomes unavailable on these boards.

To address this issue, the question suggests optimizing the xpcs_check_pcs_lock_status() function. However, before diving into potential optimizations, let’s analyze the situation:

  1. Temperature: The high temperature (75°C) is likely a contributing factor to the issue. It’s essential to ensure that the system is operating within its specified temperature range.
  2. Hardware topology: The connection between the Orin MAC and the 6113 switch chip’s MAC is a critical aspect of the system. Verify that the hardware setup is correct and that there are no issues with the physical connection.
  3. Software: The error message suggests a problem with the xpcs_check_pcs_lock_status() function, which is part of the NVIDIA Ethernet driver. It’s possible that the driver is not properly handling the temperature-related stress.

Comparing this issue to other context topics:

  • DGX-043 DGX H100/H200/B200 After upgrading to BMC firmware 24.09.17 Fans will run at a higher speed: This topic discusses a similar temperature-related issue on a different NVIDIA platform (DGX). Although the specific problem is different, it highlights the importance of temperature management in NVIDIA systems.
  • MLNX_EN (Software): This topic discusses the NVIDIA Ethernet driver, which is relevant to the issue at hand. Understanding the driver’s behavior and potential limitations in high-temperature environments might be helpful in resolving the problem.

To optimize the xpcs_check_pcs_lock_status() function, you may want to consider the following:

  1. Temperature monitoring: Implement temperature monitoring to detect when the system is operating outside its specified temperature range. This can help identify potential issues before they occur.
  2. Error handling: Improve error handling within the xpcs_check_pcs_lock_status() function to better manage temperature-related stress. This might involve implementing retries, timeouts, or other mechanisms to recover from errors.
  3. Driver updates: Verify that the NVIDIA Ethernet driver is up-to-date, as newer versions may include improvements for high-temperature operation.
  4. Hardware validation: Validate the hardware setup to ensure that the connection between the Orin MAC and the 6113 switch chip’s MAC is correct and functioning properly.

In conclusion, while optimizing the xpcs_check_pcs_lock_status() function might be a potential solution, it’s essential to consider the broader system context, including temperature management, hardware setup, and software driver behavior. A comprehensive approach will help ensure that the issue is properly addressed and that the system operates reliably in various environmental conditions.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.