Broke CUDA by upgrading driver and can't recover

Hello,

I’m trying to fix a broken CUDA/NVIDIA setup using Titan XP, Ubuntu 18.04, CUDA 9 and NVIDIA driver version 387.26. I had installed CUDA and the drivers with the runfile and am using the onboard VGA for video output.

This was working fine till nvidia-smi indicated a driver update was available. Someone updated the driver, it broke the CUDA support and caused a login loop.

It seems that they installed the repo drivers on top of the runfile drivers. I used the runfile because I use onboard VGA and need the option to not install the nvidia opengl libs. If I let it install those, I get a login loop with Unity desktop.

I tried uninstalling and purging the nvidia* packages from repo and with the uninstaller to get to a clean state for re-install. The runfile is now giving me errors.

I can install the repo driver and I can run nvidia-smi, but I have the login loop problem because the repo method overwrites my opengl-libs.

Any advice on how to fix this would be great.

Thank you for your time.

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Wed Apr 24 17:25:09 2019
installer version: 387.26

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --ui=none
    --no-questions
    --accept-license
    --disable-nouveau
    --no-opengl-files
    --dkms

Using built-in stream user interface
-> Detected 12 CPUs online; setting concurrency level to 12.
-> License accepted by command line option.
-> Installing NVIDIA driver version 387.26.
-> There appears to already be a driver installed on your system (version: 387.26).  As part of installing this driver (version: 387.26), the existing driver will be uninstalled.  Are you sure you want to continue? (Answer: Continue installation)
-> Running distribution scripts
   executing: '/usr/lib/nvidia/pre-install'...
-> done.
-> The distribution-provided pre-install script failed!  Are you sure you want to continue? (Answer: Continue installation)
WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have rebooted your system since these files were written.  If you have rebooted, then Nouveau may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration file.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory.  Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written.  For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification of the initial ramdisk.  Please reboot your system and attempt NVIDIA driver installation again.  Note if you later wish to reenable Nouveau, you will need to delete these files: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
-> Installing both new and classic TLS OpenGL libraries.
-> Installing both new and classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Uninstalling the previous installation with /usr/bin/nvidia-uninstall.
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (387.26):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 387.26 -k 4.4.0-145-generic`: 
Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area....
'make' -j12 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.4.0-145-generic modules.....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 387.26 not found
Error! Bad return status for module build on kernel: 4.4.0-145-generic (x86_64)
Consult /var/lib/dkms/nvidia/387.26/build/make.log for more information.
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
DKMS make.log for nvidia-387.26 for kernel 4.4.0-145-generic (x86_64)
Wed Apr 24 17:25:18 EDT 2019
make[1]: Entering directory '/usr/src/linux-headers-4.4.0-145-generic'
  SYMLINK /var/lib/dkms/nvidia/387.26/build/nvidia/nv-kernel.o
  SYMLINK /var/lib/dkms/nvidia/387.26/build/nvidia-modeset/nv-modeset-kernel.o
 CONFTEST: INIT_WORK
 CONFTEST: remap_pfn_range
 CONFTEST: hash__remap_4k_pfn
 CONFTEST: follow_pfn
 CONFTEST: vmap
 CONFTEST: set_pages_uc
 CONFTEST: set_memory_uc
 CONFTEST: set_memory_array_uc
 CONFTEST: change_page_attr
 CONFTEST: pci_choose_state
 CONFTEST: pci_get_class
 CONFTEST: vm_insert_page
 CONFTEST: acpi_device_id
 CONFTEST: acquire_console_sem
 CONFTEST: console_lock
 CONFTEST: kmem_cache_create
 CONFTEST: on_each_cpu
 CONFTEST: smp_call_function
 CONFTEST: acpi_evaluate_integer
 CONFTEST: ioremap_cache
 CONFTEST: ioremap_wc
 CONFTEST: acpi_walk_namespace
 CONFTEST: pci_domain_nr
 CONFTEST: sg_alloc_table
 CONFTEST: pci_dma_mapping_error
 CONFTEST: pci_get_domain_bus_and_slot
 CONFTEST: get_num_physpages
 CONFTEST: sg_init_table
 CONFTEST: efi_enabled
 CONFTEST: proc_create_data
 CONFTEST: pde_data
 CONFTEST: proc_remove
 CONFTEST: pm_vt_switch_required
 CONFTEST: xen_ioemu_inject_msi
 CONFTEST: phys_to_dma
 CONFTEST: get_dma_ops
 CONFTEST: write_cr4
 CONFTEST: of_get_property
 CONFTEST: of_find_node_by_phandle
 CONFTEST: of_node_to_nid
 CONFTEST: pnv_pci_get_npu_dev
 CONFTEST: for_each_online_node
 CONFTEST: node_end_pfn
 CONFTEST: pci_bus_address
 CONFTEST: pci_stop_and_remove_bus_device
 CONFTEST: pci_remove_bus_device
 CONFTEST: request_threaded_irq
 CONFTEST: register_cpu_notifier
 CONFTEST: cpuhp_setup_state
 CONFTEST: backlight_device_register
 CONFTEST: remap_page_range
 CONFTEST: address_space_init_once
 CONFTEST: kbasename
 CONFTEST: fatal_signal_pending
 CONFTEST: list_cut_position
 CONFTEST: vzalloc
 CONFTEST: wait_on_bit_lock_argument_count
 CONFTEST: bitmap_clear
 CONFTEST: usleep_range
 CONFTEST: radix_tree_empty
 CONFTEST: drm_dev_unref
 CONFTEST: drm_reinit_primary_mode_group
 CONFTEST: drm_atomic_set_mode_for_crtc
 CONFTEST: drm_atomic_clean_old_fb
 CONFTEST: get_user_pages_remote
 CONFTEST: drm_gem_object_lookup
 CONFTEST: drm_driver_has_gem_prime_res_obj
 CONFTEST: drm_atomic_state_free
 CONFTEST: drm_atomic_helper_disable_all
 CONFTEST: drm_atomic_helper_set_config
 CONFTEST: drm_atomic_helper_connector_dpms
 CONFTEST: is_export_symbol_gpl_of_node_to_nid
 CONFTEST: i2c_adapter
 CONFTEST: pm_message_t
 CONFTEST: irq_handler_t
 CONFTEST: acpi_device_ops
 CONFTEST: acpi_op_remove
 CONFTEST: outer_flush_all
 CONFTEST: proc_dir_entry
 CONFTEST: scatterlist
 CONFTEST: sg_table
 CONFTEST: file_operations
 CONFTEST: vm_operations_struct
 CONFTEST: atomic_long_type
 CONFTEST: pci_save_state
 CONFTEST: file_inode
 CONFTEST: task_struct
 CONFTEST: kuid_t
 CONFTEST: dma_ops
 CONFTEST: dma_map_ops
 CONFTEST: noncoherent_swiotlb_dma_ops
 CONFTEST: vm_fault_present
 CONFTEST: vm_fault_has_address
 CONFTEST: kernel_write
 CONFTEST: strnstr
 CONFTEST: iterate_dir
 CONFTEST: kstrtoull
 CONFTEST: backlight_properties_type
 CONFTEST: fault_flags
 CONFTEST: atomic64_type
 CONFTEST: address_space
 CONFTEST: backing_dev_info
 CONFTEST: mm_context_t
 CONFTEST: pnv_npu2_init_context
 CONFTEST: vm_ops_fault_removed_vma_arg
 CONFTEST: drm_bus_present
 CONFTEST: drm_bus_has_bus_type
 CONFTEST: drm_bus_has_get_irq
 CONFTEST: drm_bus_has_get_name
 CONFTEST: drm_driver_has_legacy_dev_list
 CONFTEST: drm_driver_has_set_busid
 CONFTEST: drm_crtc_state_has_connectors_changed
 CONFTEST: drm_init_function_args
 CONFTEST: drm_mode_connector_list_update_has_merge_type_bits_arg
 CONFTEST: drm_helper_mode_fill_fb_struct
 CONFTEST: drm_master_drop_has_from_release_arg
 CONFTEST: drm_mode_config_funcs_has_atomic_state_alloc
 CONFTEST: drm_driver_unload_has_int_return_type
 CONFTEST: kref_has_refcount_of_type_refcount_t
 CONFTEST: drm_crtc_helper_funcs_has_atomic_enable
 CONFTEST: dom0_kernel_present
 CONFTEST: nvidia_vgpu_kvm_build
 CONFTEST: nvidia_grid_build
 CONFTEST: drm_available
 CONFTEST: drm_atomic_available
 CONFTEST: drm_atomic_modeset_nonblocking_commit_available
 CONFTEST: is_export_symbol_gpl_refcount_inc
 CONFTEST: is_export_symbol_gpl_refcount_dec_and_test
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-gpu-numa.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-acpi.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-chrdev.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-cray.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-dma.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-gvi.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-i2c.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-mempool.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-mmap.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-p2p.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-pat.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-procfs.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-usermap.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-vm.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-vtophys.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/os-interface.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/os-pci.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/os-usermap.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/os-registry.o
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-modeset-interface.o
In file included from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-linux.h:21:0,
                 from /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:15:
/var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c: In function ‘os_lock_user_pages’:
/var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:120:48: warning: passing argument 6 of ‘get_user_pages’ makes pointer from integer without a cast [-Wint-conversion]
                             page_count, write, force, user_pages, NULL);
                                                ^
/var/lib/dkms/nvidia/387.26/build/common/inc/nv-mm.h:106:70: note: in definition of macro ‘NV_GET_USER_PAGES’
         get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
                                                                      ^
In file included from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-pgprot.h:17:0,
                 from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-linux.h:20,
                 from /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:15:
include/linux/mm.h:1222:6: note: expected ‘struct page **’ but argument is of type ‘NvBool {aka unsigned char}’
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
      ^
In file included from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-linux.h:21:0,
                 from /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:15:
/var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:120:55: warning: passing argument 7 of ‘get_user_pages’ from incompatible pointer type [-Wincompatible-pointer-types]
                             page_count, write, force, user_pages, NULL);
                                                       ^
/var/lib/dkms/nvidia/387.26/build/common/inc/nv-mm.h:106:77: note: in definition of macro ‘NV_GET_USER_PAGES’
         get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
                                                                             ^
In file included from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-pgprot.h:17:0,
                 from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-linux.h:20,
                 from /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:15:
include/linux/mm.h:1222:6: note: expected ‘struct vm_area_struct **’ but argument is of type ‘struct page **’
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
      ^
In file included from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-linux.h:21:0,
                 from /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:15:
/var/lib/dkms/nvidia/387.26/build/common/inc/nv-mm.h:106:9: error: too many arguments to function ‘get_user_pages’
         get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
         ^
/var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:119:11: note: in expansion of macro ‘NV_GET_USER_PAGES’
     ret = NV_GET_USER_PAGES((unsigned long)address,
           ^
In file included from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-pgprot.h:17:0,
                 from /var/lib/dkms/nvidia/387.26/build/common/inc/nv-linux.h:20,
                 from /var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.c:15:
include/linux/mm.h:1222:6: note: declared here
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
      ^
  CC [M]  /var/lib/dkms/nvidia/387.26/build/nvidia/nv-pci-table.o
scripts/Makefile.build:285: recipe for target '/var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.o' failed
make[2]: *** [/var/lib/dkms/nvidia/387.26/build/nvidia/os-mlock.o] Error 1
make[2]: *** Waiting for unfinished jobs....
Makefile:1454: recipe for target '_module_/var/lib/dkms/nvidia/387.26/build' failed
make[1]: *** [_module_/var/lib/dkms/nvidia/387.26/build] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-145-generic'
Makefile:84: recipe for target 'modules' failed
make: *** [modules] Error 2
Module                  Size  Used by
ipt_MASQUERADE         16384  1
nf_nat_masquerade_ipv4    16384  1 ipt_MASQUERADE
nf_conntrack_netlink    40960  0
nfnetlink              16384  2 nf_conntrack_netlink
xfrm_user              32768  1
xfrm_algo              16384  1 xfrm_user
iptable_nat            16384  1
nf_conntrack_ipv4      20480  2
nf_defrag_ipv4         16384  1 nf_conntrack_ipv4
nf_nat_ipv4            16384  1 iptable_nat
xt_addrtype            16384  2
xt_conntrack           16384  1
nf_nat                 28672  2 nf_nat_ipv4,nf_nat_masquerade_ipv4
nf_conntrack          106496  6 nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_netlink,nf_conntrack_ipv4
br_netfilter           24576  0
bridge                122880  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 stp,bridge
ip6table_filter        16384  0
ip6_tables             28672  1 ip6table_filter
iptable_filter         16384  1
ip_tables              24576  2 iptable_filter,iptable_nat
x_tables               36864  7 ip6table_filter,ip_tables,ipt_MASQUERADE,xt_conntrack,iptable_filter,ip6_tables,xt_addrtype
aufs                  217088  0
overlay                49152  0
snd_hda_codec_hdmi     53248  2
ipmi_ssif              24576  0
wl                   6447104  0
snd_hda_codec_generic    77824  1
i2c_designware_platform    16384  0
i2c_designware_core    20480  1 i2c_designware_platform
i915_bpo             1343488  0
cfg80211              565248  1 wl
intel_ips              20480  1 i915_bpo
video                  40960  1 i915_bpo
skx_edac               16384  0
edac_core              53248  1 skx_edac
x86_pkg_temp_thermal    16384  0
intel_powerclamp       16384  0
coretemp               16384  0
snd_hda_intel          40960  7
kvm_intel             176128  0
snd_hda_codec         135168  3 snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_intel
kvm                   552960  1 kvm_intel
snd_hda_core           77824  4 snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel
snd_hwdep              16384  1 snd_hda_codec
irqbypass              16384  1 kvm
snd_pcm               106496  4 snd_hda_codec_hdmi,snd_hda_codec,snd_hda_intel,snd_hda_core
snd_seq_midi           16384  0
snd_seq_midi_event     16384  1 snd_seq_midi
serio_raw              16384  0
snd_rawmidi            32768  1 snd_seq_midi
snd_seq                69632  2 snd_seq_midi_event,snd_seq_midi
input_leds             16384  0
snd_seq_device         16384  3 snd_seq,snd_rawmidi,snd_seq_midi
joydev                 20480  0
snd_timer              32768  2 snd_pcm,snd_seq
idma64                 20480  0
virt_dma               16384  1 idma64
snd                    81920  24 snd_hwdep,snd_timer,snd_hda_codec_hdmi,snd_pcm,snd_seq,snd_rawmidi,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel,snd_seq_device
soundcore              16384  1 snd
mei_me                 36864  0
shpchp                 36864  0
mei                    98304  1 mei_me
intel_lpss_pci         16384  0
ioatdma                53248  0
intel_lpss             16384  1 intel_lpss_pci
wmi                    20480  0
ipmi_si                57344  0
8250_fintek            16384  0
ipmi_msghandler        49152  2 ipmi_ssif,ipmi_si
mac_hid                16384  0
ib_iser                49152  0
rdma_cm                49152  1 ib_iser
iw_cm                  45056  1 rdma_cm
ib_cm                  49152  1 rdma_cm
ib_sa                  36864  2 rdma_cm,ib_cm
ib_mad                 49152  2 ib_cm,ib_sa
ib_core               106496  6 rdma_cm,ib_cm,ib_sa,iw_cm,ib_mad,ib_iser
ib_addr                20480  2 rdma_cm,ib_core
iscsi_tcp              20480  0
libiscsi_tcp           24576  1 iscsi_tcp
libiscsi               53248  3 libiscsi_tcp,iscsi_tcp,ib_iser
scsi_transport_iscsi   102400  4 iscsi_tcp,ib_iser,libiscsi
parport_pc             32768  0
ppdev                  20480  0
lp                     20480  0
parport                49152  3 lp,ppdev,parport_pc
autofs4                40960  2
btrfs                 999424  0
raid10                 49152  0
raid1                  40960  0
raid0                  20480  0
multipath              16384  0
linear                 16384  0
raid456               106496  1
async_raid6_recov      20480  1 raid456
async_memcpy           16384  2 raid456,async_raid6_recov
async_pq               16384  2 raid456,async_raid6_recov
async_xor              16384  3 async_pq,raid456,async_raid6_recov
async_tx               16384  5 async_pq,raid456,async_xor,async_memcpy,async_raid6_recov
xor                    24576  2 btrfs,async_xor
raid6_pq              102400  4 async_pq,raid456,btrfs,async_raid6_recov
libcrc32c              16384  1 raid456
hid_generic            16384  0
usbhid                 53248  0
hid                   118784  2 hid_generic,usbhid
crct10dif_pclmul       16384  0
crc32_pclmul           16384  0
ghash_clmulni_intel    16384  0
aesni_intel           167936  0
aes_x86_64             20480  1 aesni_intel
lrw                    16384  1 aesni_intel
gf128mul               16384  1 lrw
glue_helper            16384  1 aesni_intel
ablk_helper            16384  1 aesni_intel
cryptd                 20480  3 ghash_clmulni_intel,aesni_intel,ablk_helper
ast                    57344  2
ttm                    98304  1 ast
igb                   200704  0
drm_kms_helper        155648  2 ast,i915_bpo
psmouse               131072  0
syscopyarea            16384  1 drm_kms_helper
dca                    16384  2 igb,ioatdma
sysfillrect            16384  1 drm_kms_helper
ptp                    20480  1 igb
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
pps_core               20480  1 ptp
drm                   364544  6 ast,ttm,i915_bpo,drm_kms_helper
i2c_algo_bit           16384  3 ast,igb,i915_bpo
ahci                   40960  6
libahci                32768  1 ahci
fjes                   28672  0

The v387 driver is too old for the used kernel. Use v418.56 :
https://http.download.nvidia.com/XFree86/Linux-x86_64/