Hi all,
I’ve got a 2xH100 System in an ASUS ESC4000A-E12. One of the GPUs falls off the bus under load – it does not happen without load.
Key facts:
- The system has 2x2200W redundant power supplied, power cables have been checked and the GPUs have been reseated. I assume it’s not a power issue.
- Thermal – temperature according to nvidia-smi doesn’t exceed 70°C
- Firmware (1.1.47) and BIOS (0801) are the most recent updates
- CUDA drivers 535.104.12, linux kernel version 6.2.16-060216-generic on freshly installed and otherwise empty Ubuntu Server 22.04
- no fancy grub options
- key BIOS settings are done in accordance with NVidia recommendations (https://docs.nvidia.com/certification-programs/pdf/nvidia-certified-configuration-guide.pdf), this means
o ECC Memory: ENABLED
o ACS: DISABLED
o IOMMU: DISABLED
o CPU Virtualization: DISABLED
o PCIe ACS Enable: ENABLED
o PCIe TenBit Tag Support: ENABLED
o PCIe Relaxed Ordering: ENABLED
Additional info: there’s a LOT of those error messages constantly
[ 1351.683330] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 1351.683335] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 1351.683337] {5}[Hardware Error]: event severity: corrected
[ 1351.683338] {5}[Hardware Error]: Error 0, type: corrected
[ 1351.683339] {5}[Hardware Error]: section_type: PCIe error
[ 1351.683340] {5}[Hardware Error]: port_type: 0, PCIe end point
[ 1351.683341] {5}[Hardware Error]: version: 0.2
[ 1351.683342] {5}[Hardware Error]: command: 0x0406, status: 0x0010
[ 1351.683343] {5}[Hardware Error]: device_id: 0000:c2:00.0
[ 1351.683344] {5}[Hardware Error]: slot: 0
[ 1351.683345] {5}[Hardware Error]: secondary_bus: 0x00
[ 1351.683345] {5}[Hardware Error]: vendor_id: 0x10de, device_id: 0x2331
[ 1351.683346] {5}[Hardware Error]: class_code: 030200
[ 1351.683347] {5}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
[ 1351.683390] nvidia 0000:c2:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 1351.683414] nvidia 0000:c2:00.0**: [ 0] RxErr (First)**
[ 1351.683416] nvidia 0000:c2:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
Any idea what I could do in order to rule out that it is a software issue? I’m close to declaring the H100 card the culprit, but would like to get also some help from this forum here.
Best regards, Martin