H100 GPU has fallen off the bus -- every day

Hi all,

I’ve got a 2xH100 System in an ASUS ESC4000A-E12. One of the GPUs falls off the bus under load – it does not happen without load.

Key facts:

  • The system has 2x2200W redundant power supplied, power cables have been checked and the GPUs have been reseated. I assume it’s not a power issue.
  • Thermal – temperature according to nvidia-smi doesn’t exceed 70°C
  • Firmware (1.1.47) and BIOS (0801) are the most recent updates
  • CUDA drivers 535.104.12, linux kernel version 6.2.16-060216-generic on freshly installed and otherwise empty Ubuntu Server 22.04
  • no fancy grub options
  • key BIOS settings are done in accordance with NVidia recommendations (https://docs.nvidia.com/certification-programs/pdf/nvidia-certified-configuration-guide.pdf), this means
    o ECC Memory: ENABLED
    o CPU Virtualization: DISABLED
    o PCIe ACS Enable: ENABLED
    o PCIe TenBit Tag Support: ENABLED
    o PCIe Relaxed Ordering: ENABLED

Additional info: there’s a LOT of those error messages constantly

[ 1351.683330] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 1351.683335] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 1351.683337] {5}[Hardware Error]: event severity: corrected
[ 1351.683338] {5}[Hardware Error]: Error 0, type: corrected
[ 1351.683339] {5}[Hardware Error]: section_type: PCIe error
[ 1351.683340] {5}[Hardware Error]: port_type: 0, PCIe end point
[ 1351.683341] {5}[Hardware Error]: version: 0.2
[ 1351.683342] {5}[Hardware Error]: command: 0x0406, status: 0x0010
[ 1351.683343] {5}[Hardware Error]: device_id: 0000:c2:00.0
[ 1351.683344] {5}[Hardware Error]: slot: 0
[ 1351.683345] {5}[Hardware Error]: secondary_bus: 0x00
[ 1351.683345] {5}[Hardware Error]: vendor_id: 0x10de, device_id: 0x2331
[ 1351.683346] {5}[Hardware Error]: class_code: 030200
[ 1351.683347] {5}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
[ 1351.683390] nvidia 0000:c2:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 1351.683414] nvidia 0000:c2:00.0**: [ 0] RxErr (First)**
[ 1351.683416] nvidia 0000:c2:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

Any idea what I could do in order to rule out that it is a software issue? I’m close to declaring the H100 card the culprit, but would like to get also some help from this forum here.

Best regards, Martin

All H100 PCIE GPUs come with a no-additional-cost license for NVIDIA AI Enterprise.

Another alternative is to contact the system vendor. I assume you purchased those H100 GPUs via the system purchase from ASUS.

To help suggest it is not a software issue, try running a NVIDIA sample code like nbody repeatedly. If the problem reproduces that way, it’s not likely related to the software application you are running.