Nvidia-smi does not show one of 4 gpus but lspci shows all of them

Until two days before, I was able to run my code on all 4 gpus. Two days later (I didn’t change any system settings, nor is the machine accessed by anyone else). I can see all 4 GPUs on lspci output but not on nvidia-smi. Following is the output of nvidia-smi:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … Off | 00000000:02:00.0 Off | N/A |
| 45% 76C P2 69W / 250W | 767MiB / 11264MiB | 21% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … Off | 00000000:03:00.0 Off | N/A |
| 37% 69C P2 66W / 250W | 767MiB / 11264MiB | 20% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … Off | 00000000:83:00.0 Off | N/A |
| 35% 67C P2 66W / 250W | 701MiB / 11264MiB | 20% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 31415 C python 765MiB |
| 1 N/A N/A 31829 C python 765MiB |
| 2 N/A N/A 31901 C python 699MiB |
±----------------------------------------------------------------------------+

Following is the output of lspci:

02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
02:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
82:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
83:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
83:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

You should check dmesg for errors.

I get the following output:

14123.750341] nvidia: loading out-of-tree module taints kernel.
[14123.750351] nvidia: module license ‘NVIDIA’ taints kernel.
[14123.750351] Disabling lock debugging due to kernel taint
[14123.768201] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[14123.778660] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[14123.780863] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[14123.896852] nvidia 0000:03:00.0: enabling device (0100 → 0103)
[14123.896906] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[14124.012768] nvidia 0000:82:00.0: enabling device (0100 → 0103)
[14124.012916] nvidia 0000:82:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[14124.128863] nvidia 0000:83:00.0: enabling device (0100 → 0103)
[14124.128912] nvidia 0000:83:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[14124.244753] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
[14124.257189] nvidia-uvm: Loaded the UVM driver, major device number 511.
[14124.258924] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 515.65.01 Wed Jul 20 13:43:59 UTC 2022
[14124.260086] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[14124.260087] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 0
[14124.260161] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[14124.260162] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[14124.260242] [drm] [nvidia-drm] [GPU ID 0x00008200] Loading driver
[14124.260243] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:82:00.0 on minor 2
[14124.260316] [drm] [nvidia-drm] [GPU ID 0x00008300] Loading driver
[14124.260317] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:83:00.0 on minor 3
[14124.264480] [drm] [nvidia-drm] [GPU ID 0x00008300] Unloading driver
[14124.264649] [drm] [nvidia-drm] [GPU ID 0x00008200] Unloading driver
[14124.264804] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[14124.264969] [drm] [nvidia-drm] [GPU ID 0x00000200] Unloading driver
[14124.293600] nvidia-modeset: Unloading
[14124.323235] nvidia-uvm: Unloaded the UVM driver.
[14124.351166] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[14147.704002] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[14147.706303] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[14147.822307] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[14147.938191] nvidia 0000:82:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[14148.054202] nvidia 0000:83:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[14148.170025] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Jul 20 14:00:58 UTC 2022
[14148.174900] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 515.65.01 Wed Jul 20 13:43:59 UTC 2022
[14148.176223] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
[14148.176225] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:00.0 on minor 0
[14148.176321] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[14148.176323] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[14148.176406] [drm] [nvidia-drm] [GPU ID 0x00008200] Loading driver
[14148.176407] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:82:00.0 on minor 2
[14148.176489] [drm] [nvidia-drm] [GPU ID 0x00008300] Loading driver
[14148.176490] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:83:00.0 on minor 3
[14239.677508] pmd_set_huge: Cannot satisfy [mem 0xd0000000-0xd0200000] with a huge-page mapping due to MTRR override.
[21381.861983] nvidia-uvm: Loaded the UVM driver, major device number 235.
[289287.689646] perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[338053.814027] perf: interrupt took too long (3143 > 3137), lowering kernel.perf_event_max_sample_rate to 63500
[339381.569222] perf: interrupt took too long (3938 > 3928), lowering kernel.perf_event_max_sample_rate to 50750
[351659.120270] perf: interrupt took too long (4924 > 4922), lowering kernel.perf_event_max_sample_rate to 40500
[464335.167597] perf: interrupt took too long (9710 > 6155), lowering kernel.perf_event_max_sample_rate to 20500
[2601349.687259] NVRM: GPU at PCI:0000:82:00: GPU-7cd10988-e9e6-7e1c-b496-bd8642739e13
[2601349.687262] NVRM: Xid (PCI:0000:82:00): 62, pid=‘’, name=, 0a97(2a70) 05029059 ffffffb5
[2601442.339785] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x25:0xffff:1428)
[2601442.339832] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[2601442.599818] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x23:0xffff:1382)
[2601442.599853] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[2601444.417360] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x23:0xffff:1382)
[2601444.417398] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[2601444.441429] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x23:0xffff:1382)
[2601444.441462] NVRM: GPU 0000:82:00.0: rm_init_adapter failed, device minor number 2
[2601450.883701] NVRM: GPU 0000:82:00.0: RmInitAdapter failed! (0x23:0xffff:1382)

The missing gpu is inaccessible. Please reboot. Furthermore, you need to set nvidia-persistenced daemon to start on boot.

How do I set nvidia-persistenced? I tried the following command: nvidia-persistenced and /usr/bin/nvidia-persistenced --verbose. I get the same error for both of them: nvidia-persistenced failed to initialize. Check syslog for more details. When I try sudo systemctl enable nvidia-persistenced, I get the error as Failed to enable unit: Unit file nvidia-persistenced.service does not exist.