NVIDIA GPUs on Ubuntu 22.04 LTS: one GPU keeps disappearing after installing nvidia driver

I have a brand new ubuntu desktop which has two GPUs - A30 and geforce RTX 4080.
NVIDIA SDK was pre-installed, but I could not find nvidia driver.
So I installed the recommended Driver Version: 535.161.08, and restarted the system.
After rebooting, on the main console I cannot see the GUI (reboot,jpg), but I can login from a terminal. I tried updating /etc/default/grub with following
GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash loglevel=3 ibt=off”
I also tried running startx from terminal, but it gave error (startx.jpg)

On the terminal, nvidia-smi command after rebooting works fine (see screenshot), but after about 30 minutes gives the following error

$nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

$nvidia-smi -L
Unable to determine the device handle for gpu 0000:01:00.0: Unknown Error
GPU 1: NVIDIA GeForce RTX 4080 SUPER (UUID: GPU-def59770-4aed-bccf-e063-2c72ab0ac873)

I would like to know how can I

  1. get GUI back on the main console
  2. get both GPUs working

Your suggestions would be greatly appreciated.

Archana
I am including screenshots of ubuntu-drivers devices command for more information

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

1 Like

Thanks for prompting me. Here is the bug report

Archana

Can’t download, it wants a user and password.

Sorry, I made it shared now

The A30 is shutting down, I suspect due to overheating since you run it in a desktop mainboard. The A30 needs external fans to cool it, you can’t just put it in a desktop.

Thank Generix,

I would like to clarify that by ‘desktop’ I meant Ubuntu desktop OS.
My computer is a workstation PC (not a desktop) and has an integrated water cooler.
So I am not sure if overheating is the cause. Is there a way I can check that in the bug report or otherwise?

Archana

In the bug report, the gpu is already off so I can’t see its temperature. To check, please shut down the computer, leave it off for half an hour to have it cool down. Immediately after turning it on and login, run
nvidia-smi -q -d TEMPERATURE -l 1
in a terminal, it will display the gpu temperature in a loop.
Other reasons for the gpu shutting down would be lack of power but I doubt that since the A30 only draws 165 W. Or a defective gpu.

1 Like

Thanks Generix for your help!

I shut down the system overnight. Then restarted and prepared the bug report 1.
I also monitored GPU temperature using nvidia-smi -q in a loop as you suggested. It spans the log till A30 shuts down. Indeed A30 temperature is shooting up beyond the allowed max temperature, and then it shuts down. I am sharing the temperature log that I created. I also prepared another bug report 2 at this time.

Is there a possibility that the fan may be shutting off? I see fan speed as N/A for A30 (see smi_q3.log - Google Drive). What does it mean?

I read that sensors command will show both fan speed and gpu temperatures.

Archana

That’s what I was trying to tell you, it doesn’t have a fan of its own, it needs an external fan blowing through it. It’s built for special gpu servers providing the airflow, not desktops. Check ebay for a hose to mount a fan on it or a water-cooling mod if one exists.

1 Like

Thank you for helping me figure it out. I

Archana