There are 4 RTX A6000 cards in the server, after a month and a half of work, one card disappears.
We are reloading, 2 cards left. We do another reboot, there is 1 card left. Turned the server on and off by power, 0 cards.
The supplier took the server, checked it and said:
We received the server without GPUs and the BMC/IPMI was not reset.
I have installed 4 GPUs and they are all working properly. Swapped around slots, rebooted several times and ran cburn test, no issues.
I will clear the BMC settings and try to update the BIOS as recommended by me.
When this is done you can pick up the server again as the problem does not occur.
GIGABYTE 2U G242-Z11 quad GPU Server
1 x AMD EPYC NAPELS 7551
1 x GIGBAYTE X550-T2 dual 10GB RJ45