ConnectX-6 MCX653105A-HDAT固件内部错误

Uploading: image.png…
请问有什么解决办法吗

Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: poll_health:1087:(pid 1786): device’s health compromised - reached miss count
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:513:(pid 1786): Health issue observed, firmware internal error, severity(3) ERROR:
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:517:(pid 1786): assert_var[0] 0x00000000
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:517:(pid 1786): assert_var[1] 0x00000000
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:517:(pid 1786): assert_var[2] 0x00000000
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:517:(pid 1786): assert_var[3] 0x00000000
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:517:(pid 1786): assert_var[4] 0x00000000
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:517:(pid 1786): assert_var[5] 0x00000000
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:519:(pid 1786): assert_exit_ptr 0x209f1660
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:520:(pid 1786): assert_callra 0x209f8520
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:522:(pid 1786): fw_ver 20.39.3560
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:523:(pid 1786): time 0
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:524:(pid 1786): hw_id 0x0000020f
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:525:(pid 1786): rfr 0
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:526:(pid 1786): severity 3 (ERROR)
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:527:(pid 1786): irisc_index 5
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:529:(pid 1786): synd 0x1: firmware internal error
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:530:(pid 1786): ext_synd 0x8a02
Jul 05 15:07:34 centosc9 kernel: mlx5_core 0000:01:00.0: print_health_info:531:(pid 1786): raw fw_ver 0x14270de8

Dear Customer,

Thank you for reaching out to the NVIDIA Community.

To address your inquiry, please follow the steps below to update your device firmware:

Using flint to Update Mellanox Firmware

The flint utility is a command-line tool provided by Mellanox (NVIDIA) for burning (updating) firmware on ConnectX and other Mellanox network adapters. Below are the essential steps and usage examples for updating firmware with flint.

1. Preparation

  • Install Mellanox Firmware Tools (MFT):
    • Download and install the MFT package from the official NVIDIA Networking support site.
  • Download Firmware:
    • Obtain the correct firmware image (usually a .bin file) for your specific adapter model and PSID from the NVIDIA support site.
  • Identify the Device:
    • Use mst status to list Mellanox devices and get the device name (e.g., /dev/mst/mt4119_pci_cr0).

2. Basic flint Commands

Query Current Firmware Version

flint -d <device_name> q
  • Example:
    flint -d /dev/mst/mt4119_pci_cr0 q

Burn (Update) Firmware

flint -d <device_name> -i <firmware_file>.bin burn
  • Example:
    flint -d /dev/mst/mt4119_pci_cr0 -i fw-4119-rel-28_37_1014.bin burn
  • The burn command writes the new firmware to the device.

Verify Firmware Version After Update

flint -d <device_name> q
  • Confirm the firmware version matches the new image.

3. Step-by-Step Firmware Update Procedure

Step Command/Action Notes
Start MST service mst start Initializes Mellanox device support
List devices mst status Find the correct device name
Unzip firmware unzip <firmware_file>.zip If firmware is zipped
Burn firmware flint -d <device_name> -i <firmware_file>.bin burn Main update step
Reboot reboot Required for update to take effect
Verify flint -d <device_name> q Check new firmware version

4. Common Options and Flags

  • -d <device_name>: Specifies the Mellanox device.
  • -i <image_file>: Specifies the firmware binary image.
  • burn: Command to write the firmware.
  • q: Query device for current firmware and attributes.
  • -y: Non-interactive mode (assume “yes” to prompts).

5. Example Full Workflow

Start MST service

sudo mst start

List Mellanox devices

sudo mst status

Burn firmware (replace with your device and firmware file)

sudo flint -d /dev/mst/mt4119_pci_cr0 -i fw-4119-rel-28_37_1014.bin burn

Reboot system

sudo reboot

Verify firmware version

sudo flint -d /dev/mst/mt4119_pci_cr0 q

Note: Please download the latest firmware from the link below:
https://network.nvidia.com/support/firmware/firmware-downloads/

If you have any questions or encounter any issues during the update process, please do not hesitate to contact us.

Best regards,
NVIDIA Support
1 Like

并没有解决这个问题

请问还有别的解决方法吗

尊敬的客户:

请按照上述指引进行固件(Firmware)升级。若升级过程中遇到任何问题,欢迎通过以下方式随时联系我们:

感谢您对 NVIDIA 的支持!
NVIDIA 技术支持团队