I have a brand new ubuntu desktop which has two GPUs - A30 and geforce RTX 4080.
NVIDIA SDK was pre-installed, but I could not find nvidia driver.
So I installed the recommended Driver Version: 535.161.08, and restarted the system.
After rebooting, on the main console I cannot see the GUI (reboot,jpg), but I can login from a terminal. I tried updating /etc/default/grub with following
GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash loglevel=3 ibt=off”
I also tried running startx from terminal, but it gave error (startx.jpg)
On the terminal, nvidia-smi command after rebooting works fine (see screenshot), but after about 30 minutes gives the following error
$nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error
$nvidia-smi -L
Unable to determine the device handle for gpu 0000:01:00.0: Unknown Error
GPU 1: NVIDIA GeForce RTX 4080 SUPER (UUID: GPU-def59770-4aed-bccf-e063-2c72ab0ac873)
The A30 is shutting down, I suspect due to overheating since you run it in a desktop mainboard. The A30 needs external fans to cool it, you can’t just put it in a desktop.
I would like to clarify that by ‘desktop’ I meant Ubuntu desktop OS.
My computer is a workstation PC (not a desktop) and has an integrated water cooler.
So I am not sure if overheating is the cause. Is there a way I can check that in the bug report or otherwise?
In the bug report, the gpu is already off so I can’t see its temperature. To check, please shut down the computer, leave it off for half an hour to have it cool down. Immediately after turning it on and login, run nvidia-smi -q -d TEMPERATURE -l 1
in a terminal, it will display the gpu temperature in a loop.
Other reasons for the gpu shutting down would be lack of power but I doubt that since the A30 only draws 165 W. Or a defective gpu.
I shut down the system overnight. Then restarted and prepared the bug report 1.
I also monitored GPU temperature using nvidia-smi -q in a loop as you suggested. It spans the log till A30 shuts down. Indeed A30 temperature is shooting up beyond the allowed max temperature, and then it shuts down. I am sharing the temperature log that I created. I also prepared another bug report 2 at this time.
Is there a possibility that the fan may be shutting off? I see fan speed as N/A for A30 (see smi_q3.log - Google Drive). What does it mean?
I read that sensors command will show both fan speed and gpu temperatures.
That’s what I was trying to tell you, it doesn’t have a fan of its own, it needs an external fan blowing through it. It’s built for special gpu servers providing the airflow, not desktops. Check ebay for a hose to mount a fan on it or a water-cooling mod if one exists.