We run inference in our server almost every hour which has 3 Tesla V100 GPU blades and often 1 among the 3 GPU blades goes to hibernate leaving only 2 GPU blades active (
nvidia-smi shows only two blades) and
lspci shows that all the 3 GPUs are still attached . However, when we restart the server we see that the 3 GPU blades are up and running again. Is there any reason as to why such a thing happens.
Is there any log file which i can look upto to understand the issue better.
Is there any other solution apart from the restart to bring up the 3rd GPU.