One of the GPUs on my server (with 4x Gigabyte 3090 Turbo) crashes intermittently - had to reboot the server every time it happened. I’ve been logging the power draw, temperature, etc. (see attached image). Nothing seems to be out of the ordinary up until that point when the GPU crashed. However, on the log, there’s an “[Unknown Error]” on fan speed when the GPU crashes - not sure if it means anything.
I’ve also attached the bug report - don’t know how to read it. Would anyone be able to help me figure out what caused the crash, please? It’s been really annoying.
Thanks a lot and much appreciate your help!
nvidia-bug-report.20230516.log (2.8 MB)