2nd GPU not showing in nvidia-smi in Ubuntu 22.04

2nd GPU not showing in nvidia-smi in Ubuntu 22.04

I’ve been having issues with my machine not detecting my second GPU (both RTX 3090s). This is not a new machine and is an issue that popped up a few weeks ago, which I resolved by rolling back to an older kernel (unknown version). But after a recent update, I lost that kernel and am stuck with this issue.

Here’s what I’ve tried so far:

  • Swapping GPUs in their PCI slots to rule out a hardware issue
  • Update to latest mobo BIOS
  • Fresh 22.04 install for each driver install below
  • Every NVIDIA CUDA install (>= 11.7) from the NVIDIA downloads page (deb local, deb network and run file)
  • Every Ubuntu nvidia-driver* as far back as I can go to maintain a minimum CUDA version of 11.7
  • Rolling back to an abritrarily old kernel version (5.15) using mainline
  • Rolling forward to kernel 6.4
  • Booting with a HDMI monitor attached to GPU 2

*Note that all older Ubuntu nvidia-drivers-5XX are transitional packages to either 525 or 535 (apt search nvidia-driver). The last driver I had both GPUs working was 515.

The single GPU that is listed (also my display GPU) does run CUDA workloads, but seems to make my system unstable/laggy when the job (PyTorch) starts for a few mins.

❯ uname -r
5.19.0-46-generic
❯ lspci | grep VGA
09:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
43:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)

❯ nvidia-smi
Sat Jul  1 12:11:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:43:00.0  On |                  N/A |
|  0%   41C    P8    24W / 350W |    562MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1879      G   /usr/lib/xorg/Xorg                140MiB |
|    0   N/A  N/A      2338    C+G   ...ome-remote-desktop-daemon      258MiB |
|    0   N/A  N/A      2375      G   /usr/bin/gnome-shell               87MiB |
|    0   N/A  N/A      3338      G   ...566776601308618822,262144       73MiB |
+-----------------------------------------------------------------------------+

dmesg
Link to GitHub Gist

The weird thing is very occasionally, after a fresh CUDA install (not isolated to a single driver version) and a restart, the 2nd GPU does show up in nvidia-smi. But after a reboot is disappears again. Uninstalling and reinstalling CUDA can replicate this but it appears to be random (and not what I want to do each time I reboot)

Any ideas how I can get my machine working properly again?
nvidia-bug-report.log.gz (437.4 KB)
dmesg.txt (121.3 KB)

1 Like

I’ve found these two similar threads:

1 Like

Hello anjum.sayed48
I have a same issue. Are you fixed?

Hi there - yes it was fixed after a routine Ubuntu update. Unfortunately I don’t know what was causing it or what fixed it :(

If it helps, this is the current kernel I’m running: 6.2.0-36-generic

I’m having a similar problem. I have two 4060ti’s installed, but ubuntu 22 only shows one.
nvidia-settings only shows GPU0.
6.5.0-35 kernel.
CUDA toolkit 12.5
I’m a veteran unix developer, but this one has me stumped. Any more help would be appreciated. Thanks

lspci shows this:

1a:00.0 VGA compatible controller: NVIDIA Corporation Device 2805 (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation Device 2805 (rev a1)

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:68:00.0  On |                  N/A |
|  0%   37C    P8              5W /  165W |     317MiB /  16380MiB |     25%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1368      G   /usr/lib/xorg/Xorg                            298MiB |
|    0   N/A  N/A      1735      G   xfwm4                                           3MiB |
+-----------------------------------------------------------------------------------------+