Finding the physical slot location of Nvidia GPU with an 8 GPU systems.

donaldx.macek · July 12, 2019, 10:58pm

Hello all
This might be more a Linux question than Nvidia question but this particular issue pertains to the Nvidia GPUs. I work in a lab that use about 20 servers each with 8 Nvidia GPUs in them. From time to time one or two GPUs will go bad. I know for a fact it’s bad, I can reset the GPU and it goes bad again. I have replacements.

That is not my issue, the issue is in locating the bad GPU among the good ones quickly. When I do a nvidia-smi I can locate the bad GPU and it’s UUID. But the list on nvidia-smi does not represent the physical sequential slots on the motherboard. For instance GPU0 is not on PCI slot 1 on the motherboard. So I have to turn off and unplug the server, take out one GPU, plug it in again and see if that is one I wanted. If it isn’t I have to go back, put back the GPU I pulled out and go with the next GPU and so on until I come to the one I want. I’ve gone through all GPUs before with the last one I pulled being the one I’m looking for. It’s a time consuming process.

I would like a way either to denote the GPU, maybe by increasing the fans speed temporarily so I can physically locate the GPU. Or have a way to find the slot on motherboard, by the Bus ID, which I can find.
Any Suggestions?

generix · July 12, 2019, 11:39pm

The slot# to pcie bus# mapping should be in the server docs. If not, you’ll have to document it yourself on first install which pays off if you have a lot of the same model.
If no docs are available, you could use the method with the fan speed, running an xserver on each gpu with the appropriate coolbits option set in xorg.conf, using nvidia-setings to set the fan speed to 100%. Does not work on every gpu type.

ricpruss · October 6, 2022, 9:02am

Is this still as good as it gets and no physical slot query is possible with all these new NVidia API’s?
Thanks,
Ric

generix · October 6, 2022, 9:04am

How should the nvidia driver know anything about how the mainboard was built? You can check dmidecode, maybe the board manufacturer was so nice and filled it in.

cian_roboticsmasters · October 12, 2022, 11:17am

Hey @generix ,

If you don’t know what slot a GPU is in on the mainboard (because NVIDIA driver apparently doesn’t know), then what happens in a data centre when a GPU dies?

How do they know which GPU is dead in a data centre machine?

There has to be someway of reporting or knowing this other than the UUID or manually turning the fans off.

generix · October 12, 2022, 1:43pm

A good datacentre engineer should manage that in advance if the manufacturer of the mainboard didn’t enter this in dmi. Also, please keep in mind that the pci bus id to slot assignment is not necessarily fixed but might change depending on bios/uefi firmware settings, e.g. by dis-/enabling certain on-board devices like ethernet ports.

Topic		Replies	Views
How to know which pci slot the GPU is located in? CUDA Programming and Performance	2	4551	February 13, 2010
Physical Slot to GPU mapping? NVAPI api , inception	2	1389	January 9, 2023
GPU numbering in Multi-GPU systems CUDA Programming and Performance	3	3375	July 12, 2013
NvAPI_GPU_GetBusSlotId woes NVAPI	0	1513	April 20, 2016
Windows Device Identifier CUDA Programming and Performance	2	3661	June 25, 2009
How to tell identical GPUs apart CUDA Programming and Performance	6	1951	March 20, 2009
Dear Nvdia for your next Telsa or GeForce card: Blink LED for ID New hardware feature idea CUDA Programming and Performance	3	1315	November 7, 2010
How can I find out which GPU is in my computer? CUDA Programming and Performance	2	30475	March 12, 2010
Select Slot of Main Graphic Card CUDA Programming and Performance	1	2665	February 22, 2011
CUDA device id vs NVAPI phisical GPU id CUDA Programming and Performance	10	6999	August 4, 2010

Finding the physical slot location of Nvidia GPU with an 8 GPU systems.

Related topics