I have Ubuntu 18.04 installed in dual partition with an nvidia GeForce GTX 1650 graphics card.
I installed the drivers 440.33.01, cuda 10.2 and cuDNN 7.6.5 as I am running a YOLO network using ROS melodic with the package darknet_ros (this ROS package only works with cuda 10.2).
Before installing the nvidia drivers ubuntu was using per default the additional intel graphics card my computer has and everything was fine. Now that the nvidia drivers are installed I am not able to watch videos or run the YOLO network as my computer freezes. When I reboot and try nvidia-smi the nvidia drivers are not recognized until I reboot once more. When I run the YOLO network and look at the nvidia-smi command I se e the GPU-Util gets to 99% and the Temp reaches 90C… anybody has any idea what is happening?? Why is my nvidia card getting saturated and overheated with tasks such as playing a video??? Maybe I don’t have the proper drivers installed??
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Hello, thanks for the answer. Attached the log file.
nvidia-bug-report.log.gz (362.6 KB)
I can’t really see any issue apart from once, the system seems to have forgotten about the nvidia driver for unknown reasons.
Overall, the your system is greatly outdated, Please fully update Ubuntu and also add the graphics ppa then use Software&Updates to install the latest nvidia driver
Furthermore, please set kernel parameter
Hello thanks for your answer. I set the kernel parameter and the problem still continued so I decided to upgrade Ubuntu. I am now running Ubuntu20.04 and the drivers were upgraded to:
NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0
and still my computer freezes. Attached you can find the new log file
nvidia-bug-report.log.gz (388.6 KB)
I still can’t see anything in the logs that would point to a freeze, only the gpu temperature is at 59°C while not doing much. Please monitor temperature, check your heatsink for dust.
The last report log was obtained when everything was working fine. Attached you can find a report obtained while running the YOLO network and seconds before a complete freeze. I see 3 errors inside the log file: “Kernel configuration is invalid”, “You appear to be running an X server” and “Installation failed”.
Are these relevant??
nvidia-bug-report.log.gz (172 KB)
I also deleted a dkms folder from the previous installed driver, maybe the problem has to do with broken files from the previous driver?
The upload went wrong, the archive is empty. Please upload again.
Hello after much tests, I ended up doing a clean install on my system of ubuntu20.04, installing Nvidia driver 515, CUDA version: 11.7, and cuDNN: 8.5.0. I had to make this changes to the YOLO files in order for it to compile: Object Detection on a Webcam with Yolo - #7 by AastaLLL
After all of these changes my system works fine, no more freezes when watching videos, or random heating and/or sudden fan activations. But still when running the YOLO network my system crashes after abour 58sec ( it is an improvement because it used to crashed at 20sec with the previous drivers), and the screen goes into freeze mode and again the only way of restarting the computer is to hard reboot pressing the power button, which I just did. Attached you can find the bug report I just ran after rebooting the computer.
nvidia-bug-report.log.gz (329.5 KB)
You’re getting an Xid 79, the gpu shuts down/gets lost on load. On a notebook, this is very bad sign, points to the gpu is beginning to break.
Before reinstall, you were using “performance mode”, i.e. the nvidia gpu was always used, so everything crashed the gpu. Now you’re in on-demand mode, so the per default the intel igpu is used, thenvidia gpu is only used when explicitly specified or when using cuda.
Thank you very much for the answer. Is there any action I can do to prevent my gpu from breaking?? Or any recommendations?
You might check if limiting clocks makes your gpu survive a bit longer, e.g.
nvidia-smi -lgc 300,1000
Ultimately, you will have to get your notebook repaired or replaced.