nvmlDeviceGetMigDeviceHandleByIndex return wrong MIG devices when some MIG devices deleted

How to reproduce it:

1. create 3 MIG devices.
2. call nvmlInit()
3. delete 1 of the 3 MIG.
4. call nvmlDeviceGetMigDeviceHandleByIndex() for 3 times(i = 0, 1, 2), then it will return 3 MIG with NVML_SUCCESS, not 2.

Below is the reproduced problem, lsload is my binary using the nvml api.

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    8   0   0  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   1  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@ma1gpu12 conf]# lshosts -gpu -mig
HOST_NAME   gpu_id       gpu_model   gpu_driver   gpu_factor      numa_id       vendor        devid          gid          cid       inst_name
ma1gpu12         0 NVIDIAA100_PCIE    470.57.02          8.0            0       Nvidia            0            8            0          1g.5gb
                 0 NVIDIAA100_PCIE    470.57.02          8.0            0       Nvidia            1            9            0          1g.5gb
                 0 NVIDIAA100_PCIE    470.57.02          8.0            0       Nvidia            2           10            0          1g.5gb

My suggestion would be to file a bug. Be advised that if you do so, you will probably be asked for the complete code for your utility, as well as the full set/list of shell commands and nvidia-smi commands that you followed.

@Robert_Crovella

Hi Robert,

Thanks for your suggestion. Could you please tell me where I can file a bug for Nvidia?

Regards,
James

@Robert_Crovella Sorry, I didn’t realize “file a bug” is a link :) Opening a bug now. Thanks!