Just reporting what I think is a bug (unless I missed someone else reporting this in my searches). Between different NVIDIA provided applications the identity of GPU’s are not the same. This is particularly apparent between nvidia-smi and nvidia-settings. For example, on my system:
Fri Dec 28 02:36:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:02:00.0 Off | N/A |
| 27% 27C P8 6W / 151W | 2MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1070 Off | 00000000:04:00.0 Off | N/A |
| 27% 28C P8 5W / 151W | 2MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1070 Off | 00000000:05:00.0 On | N/A |
| 23% 35C P8 9W / 151W | 345MiB / 8097MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Let’s use nvidia-smi to individually check the GPUs’ UUIDs. You can do them all at once and they do list in 0,1,2… order, but for sanity:
$ nvidia-smi -i 0 -a | grep "GPU UUID"
GPU UUID: GPU-310441ba-7850-35be-2b20-8de8f14ce17b
$ nvidia-smi -i 1 -a | grep "GPU UUID"
GPU UUID: GPU-27d29cdc-c85d-a69f-a3fa-ca3dd74cded3
$ nvidia-smi -i 2 -a | grep "GPU UUID"
GPU UUID: GPU-bfcc67ed-7d6d-7202-179a-e3da0799eb99
Now, if we use the –query gpus option of nvidia-settings, the following is what I get:
3 GPUs on linstat:0
[0] hostname:0[gpu:0] (GeForce GTX 1070)
Has the following names:
GPU-0
GPU-bfcc67ed-7d6d-7202-179a-e3da0799eb99
[1] hostname:0[gpu:1] (GeForce GTX 1070)
Has the following names:
GPU-1
GPU-27d29cdc-c85d-a69f-a3fa-ca3dd74cded3
[2] hostname:0[gpu:2] (GeForce GTX 1070)
Has the following names:
GPU-2
GPU-310441ba-7850-35be-2b20-8de8f14ce17b
Is there a reason for this reversal of ID’s? If I check the PCI buses through nvidia-smi, the following is what I get:
$ nvidia-smi -a | grep -e "Bus Id" -e "UUID"
GPU UUID : GPU-310441ba-7850-35be-2b20-8de8f14ce17b
Bus Id : 00000000:02:00.0
GPU UUID : GPU-27d29cdc-c85d-a69f-a3fa-ca3dd74cded3
Bus Id : 00000000:04:00.0
GPU UUID : GPU-bfcc67ed-7d6d-7202-179a-e3da0799eb99
Bus Id : 00000000:05:00.0
So it appears nvidia-smi chooses the indexes based off of (or at least, in line with) the chosen buses while nvidia-settings goes off of something else. Not quite sure what yet. Personally I would like to know why my buses keep ending up like this although my GPU’s are physically plugged-in in the order nvidia-settings is detecting.
Mobo: AsRock X99 WS (ASRock > X99 WS)
GPUs:
- Display: PCIe 1
- Headless: PCIe 2
- Headless: PCIe 4
From an index base 1 to an index base 0 order in reverse, those would correlate to:
1 → 5
2 → 4
4 → 2
which is what nvidia-smi reports. Whether or not this is a driver issue or a Linux issue, I’m not sure. Any ideas or suggestions would be appreciated. I’m working on an automated fan control system, and this is just something I caught when doing preliminary research.
Cheers,
Mike