Finding the physical slot location of Nvidia GPU with an 8 GPU systems.

Hello all
This might be more a Linux question than Nvidia question but this particular issue pertains to the Nvidia GPUs. I work in a lab that use about 20 servers each with 8 Nvidia GPUs in them. From time to time one or two GPUs will go bad. I know for a fact it’s bad, I can reset the GPU and it goes bad again. I have replacements.

That is not my issue, the issue is in locating the bad GPU among the good ones quickly. When I do a nvidia-smi I can locate the bad GPU and it’s UUID. But the list on nvidia-smi does not represent the physical sequential slots on the motherboard. For instance GPU0 is not on PCI slot 1 on the motherboard. So I have to turn off and unplug the server, take out one GPU, plug it in again and see if that is one I wanted. If it isn’t I have to go back, put back the GPU I pulled out and go with the next GPU and so on until I come to the one I want. I’ve gone through all GPUs before with the last one I pulled being the one I’m looking for. It’s a time consuming process.

I would like a way either to denote the GPU, maybe by increasing the fans speed temporarily so I can physically locate the GPU. Or have a way to find the slot on motherboard, by the Bus ID, which I can find.
Any Suggestions?

The slot# to pcie bus# mapping should be in the server docs. If not, you’ll have to document it yourself on first install which pays off if you have a lot of the same model.
If no docs are available, you could use the method with the fan speed, running an xserver on each gpu with the appropriate coolbits option set in xorg.conf, using nvidia-setings to set the fan speed to 100%. Does not work on every gpu type.

Is this still as good as it gets and no physical slot query is possible with all these new NVidia API’s?
Thanks,
Ric

How should the nvidia driver know anything about how the mainboard was built? You can check dmidecode, maybe the board manufacturer was so nice and filled it in.

Hey @generix ,

If you don’t know what slot a GPU is in on the mainboard (because NVIDIA driver apparently doesn’t know), then what happens in a data centre when a GPU dies?

How do they know which GPU is dead in a data centre machine?

There has to be someway of reporting or knowing this other than the UUID or manually turning the fans off.

A good datacentre engineer should manage that in advance if the manufacturer of the mainboard didn’t enter this in dmi. Also, please keep in mind that the pci bus id to slot assignment is not necessarily fixed but might change depending on bios/uefi firmware settings, e.g. by dis-/enabling certain on-board devices like ethernet ports.