Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. on Titan xp and 1080Ti

Hi, I desperately need your help. I cant solve this problem by using ALL METHODS online.

I run my pytorch code for about 360 steps and this error is thrown by nvidia-smi:
unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU

nvidia-log is attached below.

however I still can detected my first GPU

>>lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

I have also two 2080Ti with no Nvlink, on the same setting and motherboard, they also fail with same error: GPU is lost.

Note I have used nvidia-smi -l 1 > watch.txt to make sure that this problem is not caused by overheating.

In addition, I don’t think it is PSU problem. I have 1200W power supporting titan xp and 1080Ti, 7700K and 4 * 8G memory, of course, motherboard and something else like hardisk… I use two 8pin_to_8pin to support one GPU, that is 4 * 8pin_to_8pin for titan xp and 1080Ti (previously, it is 2 * 2080Ti)

Appreciate your help sincerely in advance. I desperately need your help!
nvidia-bug-report.log (2.93 MB)

This is from cat /var/log/nvidia-installer.log

vidia-installer log file '/var/log/nvidia-installer.log'                                                                                                                                                   
creation time: Sat Sep 21 15:39:13 2019                                                                                                                                                                     
installer version: 410.48                                                                                                                                                                                   
                                                                                                                                                                                                            
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin                                                                                                                                
                                                                                                                                                                                                            
nvidia-installer command line:                                                                                                                                                                              
    ./nvidia-installer                                                                                                                                                                                      
    --ui=none                                                                                                                                                                                               
    --no-questions                                                                                                                                                                                          
    --accept-license                                                                                                                                                                                        
    --disable-nouveau                                                                                                                                                                                       
    --no-cc-version-check                                                                                                                                                                                   
    --run-nvidia-xconfig                                                                                                                                                                                    
    --dkms                                                                                                                                                                                                  
                                                                                                                                                                                                            
Using built-in stream user interface                                                                                                                                                                        
-> Detected 8 CPUs online; setting concurrency level to 8.                                                                                                                                                  
-> Installing NVIDIA driver version 410.48.                                                                                                                                                                 
-> Running distribution scripts                                                                                                                                                                             
   executing: '/usr/lib/nvidia/pre-install'...                                                                                                                                                              
-> done.                                                                                                                                                                                                    
-> The distribution-provided pre-install script failed!  Are you sure you want to continue? (Answer: Continue installation)                                                                                 
WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have rebooted your system since thes
e files were written.  If you have rebooted, then Nouveau may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration file.  Please consult the NVIDIA
 driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.                                                                               
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory.  Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written.  For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification o[45/173]
itial ramdisk.  Please reboot your system and attempt NVIDIA driver installation again.  Note if you later wish to reenable Nouveau, you will need to delete these files: /etc/modprobe.d/nvidia-installer-d
isable-nouveau.conf                                                                                                                                                                                         
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)                         
-> Installing both new and classic TLS OpenGL libraries.                                                                                                                                                    
-> Installing both new and classic TLS 32bit OpenGL libraries.                                                                                                                                              
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)                                                                                                                                           
-> Will install GLVND GLX client libraries.                                                                                                                                                                 
-> Will install GLVND EGL client libraries.                                                                                                                                                                 
-> Skipping GLX non-GLVND file: "libGL.so.410.48"                                                                                                                                                           
-> Skipping GLX non-GLVND file: "libGL.so.1"                                                                                                                                                                
-> Skipping GLX non-GLVND file: "libGL.so"                                                                                                                                                                  
-> Skipping EGL non-GLVND file: "libEGL.so.410.48"                                                                                                                                                          
-> Skipping EGL non-GLVND file: "libEGL.so"                                                                                                                                                                 
-> Skipping EGL non-GLVND file: "libEGL.so.1"                                                                                                                                                               
-> Skipping GLX non-GLVND file: "./32/libGL.so.410.48"                                                                                                                                                      
-> Skipping GLX non-GLVND file: "libGL.so.1"                                                                                                                                                                
-> Skipping GLX non-GLVND file: "libGL.so"                                                                                                                                                                  
-> Skipping EGL non-GLVND file: "./32/libEGL.so.410.48"                                                                                                                                                     
-> Skipping EGL non-GLVND file: "libEGL.so"                                                                                                                                                                 
-> Skipping EGL non-GLVND file: "libEGL.so.1"                                                                                                                                                               
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh                                                                                                                  
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...                                                                                                                             
   Checking for libglvnd installation.                                                                                                                                                                      
   Checking libGLdispatch...                                                                                                                                                                                
   Checking libGLdispatch dispatch table                                                                                                                                                                    
   Checking call through libGLdispatch                                                                                                                                                                      
   All OK                                                                                                                                                                                                   
   libGLdispatch is OK                                                                                                                                                                                      
   Checking for libGLX                                                                                                                                                                                      
   libGLX is OK                                                                                                                                                                                             
   Checking for libEGL                                                                                                                                                                                      
   libEGL is OK                                                                                                                                                                                             
   Checking entrypoint library libOpenGL.so.0                                                                                                                                                               
   Checking call through libGLdispatch
   Checking call through library libOpenGL.so.0
   All OK
   Entrypoint library libOpenGL.so.0 is OK
   Checking entrypoint library libGL.so.1
   Checking call through libGLdispatch
   Checking call through library libGL.so.1
   All OK
   Entrypoint library libGL.so.1 is OK

   Found libglvnd libraries: libGL.so.1 libOpenGL.so.0 libEGL.so.1 libGLX.so.0 libGLdispatch.so.0

   Missing libglvnd libraries:

   libglvnd appears to be installed.
Will not install libglvnd libraries.
-> Skipping GLVND file: "libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "libGLdispatch.so.0"
-> Skipping GLVND file: "libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
-> Skipping GLVND file: "./32/libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "./32/libGLdispatch.so.0"
-> Skipping GLVND file: "./32/libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "./32/libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "./32/libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "./32/libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "./32/libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (410.48):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 410.48 -k 5.0.0-27-generic`: Error! Your kernel headers for kernel 5.0.0-27-generic cannot be found.
Please install the linux-headers-5.0.0-27-generic package,
or use the --kernelsourcedir option to tell DKMS where it's located
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

What’s more, for your further reference. From the log, it said gpu has fallen off the bus. but I am sure that screws holding the gpus are tightened.

>> dmesg | tail -n 10

[ 1358.210427] NVRM: GPU at PCI:0000:01:00: GPU-3fb46dc2-926f-fb33-151d-8e2a2b230625
[ 1358.210429] NVRM: GPU Board Serial Number: 0321417091177
[ 1358.210430] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 1358.210432] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[ 1358.210433] NVRM: GPU is on Board 0321417091177.
[ 1358.210437] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.