Hi all,
I built a dual 4090 server with Ubuntu 20.04 for doing some DL-related research around 6 month ago.
Recently, I noticed that my syslog is filled with “Hardware error from APEI Generic Hardware Error Source: 514”, which seems related to the GPUs. From my observation, this error pops per 15 mins.
I remember several days ago, when I just started a DDP training job, my remote SSH session suddenly down and couldn’t ssh again (connect reset by peer), I had to reset the server through BMC. (I mostly connect to the server via ssh, but I also have a nuc just beside the server, which is on for 24x7 so that I can remotely access to the nuc and then access to the BMC via broswer…)
I also noticed that sometime, the GPU-Util from nvidia-smi behave a little bit weird, GPU0 is 100%, but GPU1 is 0%, after few seconds, GPU0 - 0%, GPU1 - 100%. Then, it become normal. Since my current training job is about spares learning, and its IO is kinda depends on the file size, so I’m not sure whether this behavior is caused by reading some extremly big file…
I found some other guys in the forums posted similar issue, some suggested upgrading the GPU driver. However, I have to use the current driver (525.147.05), as it fixed the NCCL P2P bug for 4090. Does anyone know whether this bug is also fixed in the newest version of the NVIDIA driver?
Hardware details:
- CPU: AMD EPYC 7542
- Motherboard: Tyan S8030
- GPU: Gigabyte 4090 gaming
Detail erroe message in syslog:
Feb 11 18:38:18 Amadeus kernel: [27032.059840] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
Feb 11 18:38:18 Amadeus kernel: [27032.059847] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
Feb 11 18:38:18 Amadeus kernel: [27032.059849] {7}[Hardware Error]: event severity: corrected
Feb 11 18:38:18 Amadeus kernel: [27032.059852] {7}[Hardware Error]: Error 0, type: corrected
Feb 11 18:38:18 Amadeus kernel: [27032.059855] {7}[Hardware Error]: section_type: PCIe error
Feb 11 18:38:18 Amadeus kernel: [27032.059856] {7}[Hardware Error]: port_type: 1, legacy PCI end point
Feb 11 18:38:18 Amadeus kernel: [27032.059858] {7}[Hardware Error]: version: 0.2
Feb 11 18:38:18 Amadeus kernel: [27032.059860] {7}[Hardware Error]: command: 0x0407, status: 0x0010
Feb 11 18:38:18 Amadeus kernel: [27032.059863] {7}[Hardware Error]: device_id: 0000:41:00.0
Feb 11 18:38:18 Amadeus kernel: [27032.059866] {7}[Hardware Error]: slot: 0
Feb 11 18:38:18 Amadeus kernel: [27032.059867] {7}[Hardware Error]: secondary_bus: 0x00
Feb 11 18:38:18 Amadeus kernel: [27032.059868] {7}[Hardware Error]: vendor_id: 0x10de, device_id: 0x2684
Feb 11 18:38:18 Amadeus kernel: [27032.059871] {7}[Hardware Error]: class_code: 030000
Feb 11 18:38:18 Amadeus kernel: [27032.059872] {7}[Hardware Error]: bridge: secondary_status: 0x7000, control: 0x0000
Feb 11 18:38:18 Amadeus kernel: [27032.059875] {7}[Hardware Error]: Error 1, type: corrected
Feb 11 18:38:18 Amadeus kernel: [27032.059877] {7}[Hardware Error]: section_type: PCIe error
Feb 11 18:38:18 Amadeus kernel: [27032.059878] {7}[Hardware Error]: port_type: 0, PCIe end point
Feb 11 18:38:18 Amadeus kernel: [27032.059880] {7}[Hardware Error]: version: 0.2
Feb 11 18:38:18 Amadeus kernel: [27032.059881] {7}[Hardware Error]: command: 0x0006, status: 0x0010
Feb 11 18:38:18 Amadeus kernel: [27032.059883] {7}[Hardware Error]: device_id: 0000:41:00.1
Feb 11 18:38:18 Amadeus kernel: [27032.059885] {7}[Hardware Error]: slot: 0
Feb 11 18:38:18 Amadeus kernel: [27032.059887] {7}[Hardware Error]: secondary_bus: 0x00
Feb 11 18:38:18 Amadeus kernel: [27032.059888] {7}[Hardware Error]: vendor_id: 0x10de, device_id: 0x22ba
Feb 11 18:38:18 Amadeus kernel: [27032.059890] {7}[Hardware Error]: class_code: 040300
Feb 11 18:38:18 Amadeus kernel: [27032.059892] {7}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Feb 11 18:38:18 Amadeus kernel: [27032.060235] nvidia 0000:41:00.0: AER: aer_status: 0x00000041, aer_mask: 0x00000000
Feb 11 18:38:18 Amadeus kernel: [27032.060243] nvidia 0000:41:00.0: [ 0] RxErr (First)
Feb 11 18:38:18 Amadeus kernel: [27032.060247] nvidia 0000:41:00.0: [ 6] BadTLP
Feb 11 18:38:18 Amadeus kernel: [27032.060250] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Feb 11 18:38:18 Amadeus kernel: [27032.060263] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000041, aer_mask: 0x00000000
Feb 11 18:38:18 Amadeus kernel: [27032.060267] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
Feb 11 18:38:18 Amadeus kernel: [27032.060269] snd_hda_intel 0000:41:00.1: [ 6] BadTLP
Feb 11 18:38:18 Amadeus kernel: [27032.060272] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Cheers,
Hilbert