GPU Identification Inconsistent Between Applications

Just reporting what I think is a bug (unless I missed someone else reporting this in my searches). Between different NVIDIA provided applications the identity of GPU’s are not the same. This is particularly apparent between nvidia-smi and nvidia-settings. For example, on my system:

Fri Dec 28 02:36:59 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:02:00.0 Off |                  N/A |
| 27%   27C    P8     6W / 151W |      2MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:04:00.0 Off |                  N/A |
| 27%   28C    P8     5W / 151W |      2MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1070    Off  | 00000000:05:00.0  On |                  N/A |
| 23%   35C    P8     9W / 151W |    345MiB /  8097MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Let’s use nvidia-smi to individually check the GPUs’ UUIDs. You can do them all at once and they do list in 0,1,2… order, but for sanity:

$ nvidia-smi -i 0 -a | grep "GPU UUID"
GPU UUID: GPU-310441ba-7850-35be-2b20-8de8f14ce17b

$ nvidia-smi -i 1 -a | grep "GPU UUID"
GPU UUID: GPU-27d29cdc-c85d-a69f-a3fa-ca3dd74cded3

$ nvidia-smi -i 2 -a | grep "GPU UUID"
GPU UUID: GPU-bfcc67ed-7d6d-7202-179a-e3da0799eb99

Now, if we use the –query gpus option of nvidia-settings, the following is what I get:

3 GPUs on linstat:0

    [0] hostname:0[gpu:0] (GeForce GTX 1070)

      Has the following names:
        GPU-0
        GPU-bfcc67ed-7d6d-7202-179a-e3da0799eb99

    [1] hostname:0[gpu:1] (GeForce GTX 1070)

      Has the following names:
        GPU-1
        GPU-27d29cdc-c85d-a69f-a3fa-ca3dd74cded3

    [2] hostname:0[gpu:2] (GeForce GTX 1070)

      Has the following names:
        GPU-2
        GPU-310441ba-7850-35be-2b20-8de8f14ce17b

Is there a reason for this reversal of ID’s? If I check the PCI buses through nvidia-smi, the following is what I get:

$ nvidia-smi -a | grep -e "Bus Id" -e "UUID"
    GPU UUID                        : GPU-310441ba-7850-35be-2b20-8de8f14ce17b
        Bus Id                      : 00000000:02:00.0
    GPU UUID                        : GPU-27d29cdc-c85d-a69f-a3fa-ca3dd74cded3
        Bus Id                      : 00000000:04:00.0
    GPU UUID                        : GPU-bfcc67ed-7d6d-7202-179a-e3da0799eb99
        Bus Id                      : 00000000:05:00.0

So it appears nvidia-smi chooses the indexes based off of (or at least, in line with) the chosen buses while nvidia-settings goes off of something else. Not quite sure what yet. Personally I would like to know why my buses keep ending up like this although my GPU’s are physically plugged-in in the order nvidia-settings is detecting.

Mobo: AsRock X99 WS (ASRock > X99 WS)
GPUs:

  • Display: PCIe 1
  • Headless: PCIe 2
  • Headless: PCIe 4

From an index base 1 to an index base 0 order in reverse, those would correlate to:
1 → 5
2 → 4
4 → 2
which is what nvidia-smi reports. Whether or not this is a driver issue or a Linux issue, I’m not sure. Any ideas or suggestions would be appreciated. I’m working on an automated fan control system, and this is just something I caught when doing preliminary research.

Cheers,
Mike

1 Like