I would like to know why the gpu dies.
During deep learning learning, GPUs die.
When a problem occurs, I need to reboot to recover.
The server specifications are as follows.
CPU : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
motherboard : X11DPX-T
gpu : NVIDIA RTX 3090 24GB, Water Cooled * 8ea
GPU falling off the bus is often a bios, power supply or thermal issue. Can you please update BIOS and also verify system has adequate power supply and no thermal issue.