How to reproduce it:
1. create 3 MIG devices.
2. call nvmlInit()
3. delete 1 of the 3 MIG.
4. call nvmlDeviceGetMigDeviceHandleByIndex() for 3 times(i = 0, 1, 2), then it will return 3 MIG with NVML_SUCCESS, not 2.
Below is the reproduced problem, lsload is my binary using the nvml api.
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 8 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 1 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@ma1gpu12 conf]# lshosts -gpu -mig
HOST_NAME gpu_id gpu_model gpu_driver gpu_factor numa_id vendor devid gid cid inst_name
ma1gpu12 0 NVIDIAA100_PCIE 470.57.02 8.0 0 Nvidia 0 8 0 1g.5gb
0 NVIDIAA100_PCIE 470.57.02 8.0 0 Nvidia 1 9 0 1g.5gb
0 NVIDIAA100_PCIE 470.57.02 8.0 0 Nvidia 2 10 0 1g.5gb
My suggestion would be to file a bug. Be advised that if you do so, you will probably be asked for the complete code for your utility, as well as the full set/list of shell commands and nvidia-smi
commands that you followed.
Hi Robert,
Thanks for your suggestion. Could you please tell me where I can file a bug for Nvidia?
Regards,
James