BF3 Interface Issue

Hello community!

I have an issue with BF3 card on my GH200 server.

Even though I didn’t change any configs on my server, the BF3 inferfaces for aerial00 and aerial01 just disappear as below after I rebooted the server,
(It seems the two interfaces of BF3 have been wiped out.)

What should I reconfig or check for solving this issue?

Thanks

Hi @tojsm ,

Could you give me:

  1. The console output of the following command.
$ sudo mst start
$ sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"
  1. The output file of the following command, which should be installed if you installed the DOCA host-repo, as documented.
$ sysinfo-snapshot.py

Thank you.

Dear @nhashimoto, thanks for your supporting.

Please check belows,

**Command #1 **

root@SKT-6GTB-ARS:~# sudo mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
[warn] mst_pciconf is already loaded, skipping
Create devices
Unloading MST PCI module (unused) - Success


root@SKT-6GTB-ARS:~# sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q | grep "CQE_COMPRESSION\|PROG_PARSE_GRAPH\|ACCURATE_TX_SCHEDULER\|FLEX_PARSER_PROFILE_ENABLE\|REAL_TIME_CLOCK_ENABLE\|INTERNAL_CPU_MODEL\|LINK_TYPE_P1\|LINK_TYPE_P2\|INTERNAL_CPU_PAGE_SUPPLIER\|INTERNAL_CPU_ESWITCH_MANAGER\|INTERNAL_CPU_IB_VPORT0\|INTERNAL_CPU_OFFLOAD_ENGINE"
        INTERNAL_CPU_MODEL                          EMBEDDED_CPU(1)
        INTERNAL_CPU_PAGE_SUPPLIER                  EXT_HOST_PF(1)
        INTERNAL_CPU_ESWITCH_MANAGER                EXT_HOST_PF(1)
        INTERNAL_CPU_IB_VPORT0                      EXT_HOST_PF(1)
        INTERNAL_CPU_OFFLOAD_ENGINE                 DISABLED(1)
        FLEX_PARSER_PROFILE_ENABLE                  4
        PROG_PARSE_GRAPH                            True(1)
        ACCURATE_TX_SCHEDULER                       True(1)
        CQE_COMPRESSION                             AGGRESSIVE(1)
        REAL_TIME_CLOCK_ENABLE                      True(1)
        LINK_TYPE_P1                                ETH(2)
        LINK_TYPE_P2                                ETH(2)
root@SKT-6GTB-ARS:~#

2. output file
sysinfo-snapshot-v3.7.7-SKT-6GTB-ARS-20241119-214015.tgz.zip (12.1 MB)

**2. terminal log**
root@SKT-6GTB-ARS:~# sysinfo-snapshot.py
/usr/sbin/sysinfo-snapshot.py:21: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  from distutils.version import LooseVersion
Sysinfo-snapshot is still in process...please wait till completed successfully
Gathering the information may take a while, especially in large networks
Your patience is appreciated

+------------------------------------------------------------

Running sysinfo-snapshot has ended successfully!
Temporary destination directory is /tmp/
Out file name is /tmp/sysinfo-snapshot-v3.7.7-SKT-6GTB-ARS-20241119-214015.tgz

/tmp/sysinfo-snapshot-v3.7.7-SKT-6GTB-ARS-20241119-214015.tgz:
SKT-6GTB-ARS-20241119-214015.html
amber_info
cables
commands_txt_output
devlink
dmesg
dmidecode
ecn
err_messages:
        dummy_functions         - contains all not found commands
        dummy_paths             - contains all not existing internal files (/paths)
        dummy_external_paths    - contains all not existing external files (/paths)
etc_udev_rulesd                 - contains all files under /etc/udev/rules.d
ethtool_S                       - contains all files which are generated from invoking ethtool -S <interface>
firmware                        - contains all firmware files (mst dump files and commands outputs)
journal
lib_udev_rulesd                 - contains all files under /lib/udev/rules.d
lshw
pcie_files
performance_tuning_analyze.html
pkglist
show_irq_affinity_all
sr_iov.html
trace
var_log_dmesg
var_log_syslog

Hi @tojsm ,

According to the dmesg log, the driver somehow failed to load for 0000:01:00.0 and 0000:01:00.1.

[Tue Nov 19 16:02:14 2024] mlx5_core: probe of 0000:01:00.0 failed with error -16
[Tue Nov 19 16:04:14 2024] mlx5_core: probe of 0000:01:00.1 failed with error -16

I’m still checking the provided logs, but could you please try a cold reboot (i.e., full AC cycle) or ipmi chassis power cycle commend (you need to install ipmitool) while I’m doing so? I’ve seen this kind of driver-loading error in the past, and performing a cold reboot resolved the issues.

Thank you.

1 Like

Dear, @nhashimoto.

I appreciate your comment above.

This issue is resolved with a cold reboot especially with a full AC cycle.

Thanks!

Glad to hear that!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.