Ubuntu box crashing after re-install

I’ve got an Ubuntu box, which has happily been running Deep Learning programs for the last 8 months. Yesterday, I rebuilt it (Ubuntu 18.04, GTX 650 for the monitor, GTX 1080 Ti for the Deep Learning, driver version 410.48, running inside an NVIDIA Docker image) following the Enabling GPUs in the Container Runtime Ecosystem | NVIDIA Technical Blog documentation.

It all seemed to go well, and I can run some modest Deep Learning programs successfully using the PyTorch or Keras/Tensorflow frameworks from within Docker. The new configuration seems to run much faster than before - the nvidia-smi stats show it using 50% of the processing power, about 8GB of memory, and drawing up to 175W power. Previously, I could not get it to use more than 30% of the card.

However, periodically, I lose the monitor (I can still ssh into the box). No warning. Nothing displayed on the screen - it goes blank. Which implies I’ve installed things badly (because it all worked before).

How can I start narrowing things down? Is there a logical fault-diagnosis process I should follow? (It could be overheating or overloading the power supply, but I don’t think so as the fault occurs after 30 seconds and the power supply is rated at 1300W).

I’ve looked at /var/log/kern.log and this appears to have happened at the time of the first crash (see log file capture, below):

[ Incidentally,“The NVIDIA probe routine was not called for 2 device(s)” fills-up the log file for several hours before the problem occurs with thousands of entries- which may be a big clue ;-)
I am not running bumblebee, which is the only other place I’ve seen this raised as a support question].

Perhaps I should just start the installation again, and make sure I follow the steps more carefully. Does anyone have advice on what I need to sort out?

Regards,

-Steve Simpson

Jan 8 19:47:32 thinkstation kernel: [ 6662.213365] NVRM: The NVIDIA probe routine was not called for 2 device(s).
Jan 8 19:47:32 thinkstation kernel: [ 6662.213366] NVRM: This can occur when a driver such as:
Jan 8 19:47:32 thinkstation kernel: [ 6662.213366] NVRM: nouveau, rivafb, nvidiafb or rivatv
Jan 8 19:47:32 thinkstation kernel: [ 6662.213366] NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: Try unloading the conflicting kernel module (and/or
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: reconfigure your kernel without the conflicting
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: driver(s)), then try loading the NVIDIA kernel module
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: again.
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: No NVIDIA graphics adapter probed!
Jan 8 19:47:32 thinkstation kernel: [ 6662.213487] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
Jan 8 19:47:32 thinkstation kernel: [ 6662.309610] nvidia_drm: Unknown symbol nvKmsKapiGetFunctionsTable (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336205] nvidia_uvm: Unknown symbol nvUvmInterfaceDisableAccessCntr (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336242] nvidia_uvm: Unknown symbol nvUvmInterfaceChannelDestroy (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336279] nvidia_uvm: Unknown symbol nvUvmInterfaceQueryCaps (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336330] nvidia_uvm: Unknown symbol nvUvmInterfaceUnsetPageDirectory (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336364] nvidia_uvm: Unknown symbol nvUvmInterfaceInitAccessCntrInfo (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336397] nvidia_uvm: Unknown symbol nv_kthread_q_flush (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336430] nvidia_uvm: Unknown symbol nvUvmInterfaceReleaseChannel (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336464] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryAllocSys (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336503] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryCpuMap (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336559] nvidia_uvm: Unknown symbol nvUvmInterfaceRetainChannelResources (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336592] nvidia_uvm: Unknown symbol nvUvmInterfacePmaFreePages (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336632] nvidia_uvm: Unknown symbol nvUvmInterfaceSetPageDirectory (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336665] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryCpuUnMap (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336699] nvidia_uvm: Unknown symbol nv_kthread_q_schedule_q_item (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336734] nvidia_uvm: Unknown symbol nvUvmInterfaceOwnPageFaultIntr (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336782] nvidia_uvm: Unknown symbol nvUvmInterfaceDupAddressSpace (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336820] nvidia_uvm: Unknown symbol nvUvmInterfaceGetExternalAllocPtes (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336861] nvidia_uvm: Unknown symbol nvUvmInterfaceRegisterGpu (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336896] nvidia_uvm: Unknown symbol nvUvmInterfaceP2pObjectDestroy (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336933] nvidia_uvm: Unknown symbol nvUvmInterfaceGetNonReplayableFaults (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336966] nvidia_uvm: Unknown symbol nvUvmInterfaceGetFbInfo (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337016] nvidia_uvm: Unknown symbol nvUvmInterfaceRetainChannel (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337063] nvidia_uvm: Unknown symbol nvUvmInterfaceHasPendingNonReplayableFaults (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337098] nvidia_uvm: Unknown symbol nvUvmInterfaceDestroyAccessCntrInfo (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337150] nvidia_uvm: Unknown symbol nvUvmInterfaceStopChannel (err 0)

etc…