Hi, I desperately need your help. I cant solve this problem by using ALL METHODS online.
I run my pytorch code for about 360 steps and this error is thrown by nvidia-smi
:
unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU
nvidia-log is attached below.
however I still can detected my first GPU
>>lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
I have also two 2080Ti with no Nvlink, on the same setting and motherboard, they also fail with same error: GPU is lost.
Note I have used nvidia-smi -l 1 > watch.txt
to make sure that this problem is not caused by overheating.
In addition, I don’t think it is PSU problem. I have 1200W power supporting titan xp and 1080Ti, 7700K and 4 * 8G memory, of course, motherboard and something else like hardisk… I use two 8pin_to_8pin to support one GPU, that is 4 * 8pin_to_8pin for titan xp and 1080Ti (previously, it is 2 * 2080Ti)
Appreciate your help sincerely in advance. I desperately need your help!
nvidia-bug-report.log (2.93 MB)
1 Like
This is from cat /var/log/nvidia-installer.log
vidia-installer log file '/var/log/nvidia-installer.log'
creation time: Sat Sep 21 15:39:13 2019
installer version: 410.48
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--ui=none
--no-questions
--accept-license
--disable-nouveau
--no-cc-version-check
--run-nvidia-xconfig
--dkms
Using built-in stream user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> Installing NVIDIA driver version 410.48.
-> Running distribution scripts
executing: '/usr/lib/nvidia/pre-install'...
-> done.
-> The distribution-provided pre-install script failed! Are you sure you want to continue? (Answer: Continue installation)
WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf. Please be sure you have rebooted your system since thes
e files were written. If you have rebooted, then Nouveau may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration file. Please consult the NVIDIA
driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory. Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written. For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification o[45/173]
itial ramdisk. Please reboot your system and attempt NVIDIA driver installation again. Note if you later wish to reenable Nouveau, you will need to delete these files: /etc/modprobe.d/nvidia-installer-d
isable-nouveau.conf
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
-> Installing both new and classic TLS OpenGL libraries.
-> Installing both new and classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.410.48"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.410.48"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.410.48"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.410.48"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
Checking for libglvnd installation.
Checking libGLdispatch...
Checking libGLdispatch dispatch table
Checking call through libGLdispatch
All OK
libGLdispatch is OK
Checking for libGLX
libGLX is OK
Checking for libEGL
libEGL is OK
Checking entrypoint library libOpenGL.so.0
Checking call through libGLdispatch
Checking call through library libOpenGL.so.0
All OK
Entrypoint library libOpenGL.so.0 is OK
Checking entrypoint library libGL.so.1
Checking call through libGLdispatch
Checking call through library libGL.so.1
All OK
Entrypoint library libGL.so.1 is OK
Found libglvnd libraries: libGL.so.1 libOpenGL.so.0 libEGL.so.1 libGLX.so.0 libGLdispatch.so.0
Missing libglvnd libraries:
libglvnd appears to be installed.
Will not install libglvnd libraries.
-> Skipping GLVND file: "libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "libGLdispatch.so.0"
-> Skipping GLVND file: "libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
-> Skipping GLVND file: "./32/libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "./32/libGLdispatch.so.0"
-> Skipping GLVND file: "./32/libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "./32/libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "./32/libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "./32/libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "./32/libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (410.48):
executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 410.48 -k 5.0.0-27-generic`: Error! Your kernel headers for kernel 5.0.0-27-generic cannot be found.
Please install the linux-headers-5.0.0-27-generic package,
or use the --kernelsourcedir option to tell DKMS where it's located
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
What’s more, for your further reference. From the log, it said gpu has fallen off the bus. but I am sure that screws holding the gpus are tightened.
>> dmesg | tail -n 10
[ 1358.210427] NVRM: GPU at PCI:0000:01:00: GPU-3fb46dc2-926f-fb33-151d-8e2a2b230625
[ 1358.210429] NVRM: GPU Board Serial Number: 0321417091177
[ 1358.210430] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 1358.210432] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[ 1358.210433] NVRM: GPU is on Board 0321417091177.
[ 1358.210437] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
me too, I still can detected my GPU
nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error
dmesg | tail -n 10
[ 350.292610] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x56:624)
[ 350.292653] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 351.799000] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x56:624)
[ 351.799022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 353.305149] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x56:624)
[ 353.305169] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 354.811200] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x56:624)
[ 354.811247] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[ 356.317164] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x56:624)
[ 356.317211] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
In my case I get
(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/Research-Paper3-Code$ sudo dmesg | tail -n 10
[sudo] password for eduardoj:
[70486.997910] start_secondary+0x12a/0x180
[70486.997911] secondary_startup_64_no_verify+0xc2/0xcb
[70486.997913] </TASK>
[70486.997914] handlers:
[70486.997914] [<000000002bdf45ca>] i801_isr [i2c_i801]
[70486.997918] Disabling IRQ #16
[83763.004041] NVRM: GPU at PCI:0000:01:00: GPU-b3f1bf21-aae0-0248-c4d9-48178c67d00c
[83763.004045] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[83763.004046] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[83763.004129] NVRM: GPU 0000:01:00.0: GPU serial number is \xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff.
(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/Research-Paper3-Code$
Have you found a solution for this?
For the 3rd or 4th time now I am having to reinstall the driver on my machine because I have no idea on how to deal with this.
PS: Thankfully this time a reboot mitigated the issue. However I had to hard reset the computer since after issuing a shutdown
I lost access to the system, however the system could not fully power off.
Cheers,