I have Ubuntu 18.04 installed in dual partition with an nvidia GeForce GTX 1650 graphics card.
I installed the drivers 440.33.01, cuda 10.2 and cuDNN 7.6.5 as I am running a YOLO network using ROS melodic with the package darknet_ros (this ROS package only works with cuda 10.2).
Before installing the nvidia drivers ubuntu was using per default the additional intel graphics card my computer has and everything was fine. Now that the nvidia drivers are installed I am not able to watch videos or run the YOLO network as my computer freezes. When I reboot and try nvidia-smi the nvidia drivers are not recognized until I reboot once more. When I run the YOLO network and look at the nvidia-smi command I se e the GPU-Util gets to 99% and the Temp reaches 90C… anybody has any idea what is happening?? Why is my nvidia card getting saturated and overheated with tasks such as playing a video??? Maybe I don’t have the proper drivers installed??
Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.
Hello, thanks for the answer. Attached the log file.
nvidia-bug-report.log.gz (362.6 KB)
I can’t really see any issue apart from once, the system seems to have forgotten about the nvidia driver for unknown reasons.
Overall, the your system is greatly outdated, Please fully update Ubuntu and also add the graphics ppa then use Software&Updates to install the latest nvidia driver
Furthermore, please set kernel parameter
Hello thanks for your answer. I set the kernel parameter and the problem still continued so I decided to upgrade Ubuntu. I am now running Ubuntu20.04 and the drivers were upgraded to:
NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0
and still my computer freezes. Attached you can find the new log file
nvidia-bug-report.log.gz (388.6 KB)
I still can’t see anything in the logs that would point to a freeze, only the gpu temperature is at 59°C while not doing much. Please monitor temperature, check your heatsink for dust.
The last report log was obtained when everything was working fine. Attached you can find a report obtained while running the YOLO network and seconds before a complete freeze. I see 3 errors inside the log file: “Kernel configuration is invalid”, “You appear to be running an X server” and “Installation failed”.
Are these relevant??
nvidia-bug-report.log.gz (172 KB)
I also deleted a dkms folder from the previous installed driver, maybe the problem has to do with broken files from the previous driver?
The upload went wrong, the archive is empty. Please upload again.