A100 on a CentOS 7 server gets removed after couple of minutes

,

Ok, got it. It is related with overheating. I was checking the temperature with nvidia-smi, and it raises till 95C, and then it gets disconected.

Just in case somebody reads this entry, please check the temperature.

while true; do sleep 1; nvidia-smi >> output.txt; done

Check output.txt once your GPU dissapears.

Have a look to this: A100 crashes within 10 minutes due to over-heating on Ubuntu 18.04 (without any workload) - #6 by generix