Our platform: AGX Orin 32G with Jetson Linux 36.4, customer carrier board.
Also we connect the Orin to a 10G Marvell 6113 switch chip. The hardware topology is:
Orin MAC <------------> 6113 switch chip’s MAC
Testing step: We put 15 our boards into environment as below:
environment temperature: 75 °C,SOM internal temperature is about:
|CPU Temp (°C)|91.951|
|GPU Temp (°C)|87.823|
|SOC Temp (°C)|91.955|
Issue: 3 of 15 boards occur “nvethernet 6800000.ethernet: [xpcs_lane_bring_up][470][type:0x4][loga-0x0] Failed to get PCS block lock” errors when boot up. And the link status between Orin and 6113 switch is down, so the network in these 3 boards is unavailable.
And this issue can’t be recovered by reboot until we put these 3 boards into normal temperature environment.
*** Please note that this reply is generated by LLM automatically ***
Based on the provided context, I’ll provide a comprehensive answer to the question.
The issue described is related to the NVIDIA Jetson AGX Orin platform, specifically with the 10G Ethernet connection to a Marvell 6113 switch chip. The problem occurs when the system is exposed to high temperatures (75°C), causing 3 out of 15 boards to fail to establish a link between the Orin MAC and the 6113 switch chip’s MAC. The error message “nvethernet 6800000.ethernet: [470][type:0x4][loga-0x0] Failed to get PCS block lock” is reported, and the network becomes unavailable on these boards.
To address this issue, the question suggests optimizing the xpcs_check_pcs_lock_status() function. However, before diving into potential optimizations, let’s analyze the situation:
Temperature: The high temperature (75°C) is likely a contributing factor to the issue. It’s essential to ensure that the system is operating within its specified temperature range.
Hardware topology: The connection between the Orin MAC and the 6113 switch chip’s MAC is a critical aspect of the system. Verify that the hardware setup is correct and that there are no issues with the physical connection.
Software: The error message suggests a problem with the xpcs_check_pcs_lock_status() function, which is part of the NVIDIA Ethernet driver. It’s possible that the driver is not properly handling the temperature-related stress.
Comparing this issue to other context topics:
DGX-043 DGX H100/H200/B200 After upgrading to BMC firmware 24.09.17 Fans will run at a higher speed: This topic discusses a similar temperature-related issue on a different NVIDIA platform (DGX). Although the specific problem is different, it highlights the importance of temperature management in NVIDIA systems.
MLNX_EN (Software): This topic discusses the NVIDIA Ethernet driver, which is relevant to the issue at hand. Understanding the driver’s behavior and potential limitations in high-temperature environments might be helpful in resolving the problem.
To optimize the xpcs_check_pcs_lock_status() function, you may want to consider the following:
Temperature monitoring: Implement temperature monitoring to detect when the system is operating outside its specified temperature range. This can help identify potential issues before they occur.
Error handling: Improve error handling within the xpcs_check_pcs_lock_status() function to better manage temperature-related stress. This might involve implementing retries, timeouts, or other mechanisms to recover from errors.
Driver updates: Verify that the NVIDIA Ethernet driver is up-to-date, as newer versions may include improvements for high-temperature operation.
Hardware validation: Validate the hardware setup to ensure that the connection between the Orin MAC and the 6113 switch chip’s MAC is correct and functioning properly.
In conclusion, while optimizing the xpcs_check_pcs_lock_status() function might be a potential solution, it’s essential to consider the broader system context, including temperature management, hardware setup, and software driver behavior. A comprehensive approach will help ensure that the issue is properly addressed and that the system operates reliably in various environmental conditions.