I am using two H100 server (8*H100, RoCE)。When training llm or NCCL test, I encounter below error message in dmesg log。The NCCL bandwidth would lower down when error occuring. And training would like to have NCCL timeout error. Any suggestions?
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: event severity: corrected
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: Error 0, type: corrected
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: section_type: PCIe error
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: port_type: 0, PCIe end point
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: version: 3.0
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: command: 0x0506, status: 0x0010
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: device_id: 0001:1a:00.0
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: slot: 0
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: secondary_bus: 0x00
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: vendor_id: 0x10de, device_id: 0x2330
[Tue Oct 29 07:35:10 2024] {312}[Hardware Error]: class_code: 030200
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: It has been corrected by h/w and requires no further action
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: event severity: corrected
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: Error 0, type: corrected
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: section_type: PCIe error
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: port_type: 6, downstream switch port
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: version: 3.0
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: command: 0x0107, status: 0x0010
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: device_id: 0000:18:04.0
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: slot: 5
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: secondary_bus: 0x22
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: vendor_id: 0x1000, device_id: 0xc030
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: class_code: 060400
[Tue Oct 29 07:35:30 2024] {313}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0002