This might be more a Linux question than Nvidia question but this particular issue pertains to the Nvidia GPUs. I work in a lab that use about 20 servers each with 8 Nvidia GPUs in them. From time to time one or two GPUs will go bad. I know for a fact it’s bad, I can reset the GPU and it goes bad again. I have replacements.
That is not my issue, the issue is in locating the bad GPU among the good ones quickly. When I do a nvidia-smi I can locate the bad GPU and it’s UUID. But the list on nvidia-smi does not represent the physical sequential slots on the motherboard. For instance GPU0 is not on PCI slot 1 on the motherboard. So I have to turn off and unplug the server, take out one GPU, plug it in again and see if that is one I wanted. If it isn’t I have to go back, put back the GPU I pulled out and go with the next GPU and so on until I come to the one I want. I’ve gone through all GPUs before with the last one I pulled being the one I’m looking for. It’s a time consuming process.
I would like a way either to denote the GPU, maybe by increasing the fans speed temporarily so I can physically locate the GPU. Or have a way to find the slot on motherboard, by the Bus ID, which I can find.