Finding the physical slot location of Nvidia GPU with an 8 GPU systems.

Hello all
This might be more a Linux question than Nvidia question but this particular issue pertains to the Nvidia GPUs. I work in a lab that use about 20 servers each with 8 Nvidia GPUs in them. From time to time one or two GPUs will go bad. I know for a fact it’s bad, I can reset the GPU and it goes bad again. I have replacements.

That is not my issue, the issue is in locating the bad GPU among the good ones quickly. When I do a nvidia-smi I can locate the bad GPU and it’s UUID. But the list on nvidia-smi does not represent the physical sequential slots on the motherboard. For instance GPU0 is not on PCI slot 1 on the motherboard. So I have to turn off and unplug the server, take out one GPU, plug it in again and see if that is one I wanted. If it isn’t I have to go back, put back the GPU I pulled out and go with the next GPU and so on until I come to the one I want. I’ve gone through all GPUs before with the last one I pulled being the one I’m looking for. It’s a time consuming process.

I would like a way either to denote the GPU, maybe by increasing the fans speed temporarily so I can physically locate the GPU. Or have a way to find the slot on motherboard, by the Bus ID, which I can find.
Any Suggestions?

The slot# to pcie bus# mapping should be in the server docs. If not, you’ll have to document it yourself on first install which pays off if you have a lot of the same model.
If no docs are available, you could use the method with the fan speed, running an xserver on each gpu with the appropriate coolbits option set in xorg.conf, using nvidia-setings to set the fan speed to 100%. Does not work on every gpu type.