Ubuntu box crashing after re-install

I’ve got an Ubuntu box, which has happily been running Deep Learning programs for the last 8 months. Yesterday, I rebuilt it (Ubuntu 18.04, GTX 650 for the monitor, GTX 1080 Ti for the Deep Learning, driver version 410.48, running inside an NVIDIA Docker image) following the https://devblogs.nvidia.com/gpu-containers-runtime/ documentation.

It all seemed to go well, and I can run some modest Deep Learning programs successfully using the PyTorch or Keras/Tensorflow frameworks from within Docker. The new configuration seems to run much faster than before - the nvidia-smi stats show it using 50% of the processing power, about 8GB of memory, and drawing up to 175W power. Previously, I could not get it to use more than 30% of the card.

However, periodically, I lose the monitor (I can still ssh into the box). No warning. Nothing displayed on the screen - it goes blank. Which implies I’ve installed things badly (because it all worked before).

How can I start narrowing things down? Is there a logical fault-diagnosis process I should follow? (It could be overheating or overloading the power supply, but I don’t think so as the fault occurs after 30 seconds and the power supply is rated at 1300W).

I’ve looked at /var/log/kern.log and this appears to have happened at the time of the first crash (see log file capture, below):

[ Incidentally,“The NVIDIA probe routine was not called for 2 device(s)” fills-up the log file for several hours before the problem occurs with thousands of entries- which may be a big clue ;-)
I am not running bumblebee, which is the only other place I’ve seen this raised as a support question].

Perhaps I should just start the installation again, and make sure I follow the steps more carefully. Does anyone have advice on what I need to sort out?

Regards,

-Steve Simpson

Jan 8 19:47:32 thinkstation kernel: [ 6662.213365] NVRM: The NVIDIA probe routine was not called for 2 device(s).
Jan 8 19:47:32 thinkstation kernel: [ 6662.213366] NVRM: This can occur when a driver such as:
Jan 8 19:47:32 thinkstation kernel: [ 6662.213366] NVRM: nouveau, rivafb, nvidiafb or rivatv
Jan 8 19:47:32 thinkstation kernel: [ 6662.213366] NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: Try unloading the conflicting kernel module (and/or
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: reconfigure your kernel without the conflicting
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: driver(s)), then try loading the NVIDIA kernel module
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: again.
Jan 8 19:47:32 thinkstation kernel: [ 6662.213367] NVRM: No NVIDIA graphics adapter probed!
Jan 8 19:47:32 thinkstation kernel: [ 6662.213487] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
Jan 8 19:47:32 thinkstation kernel: [ 6662.309610] nvidia_drm: Unknown symbol nvKmsKapiGetFunctionsTable (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336205] nvidia_uvm: Unknown symbol nvUvmInterfaceDisableAccessCntr (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336242] nvidia_uvm: Unknown symbol nvUvmInterfaceChannelDestroy (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336279] nvidia_uvm: Unknown symbol nvUvmInterfaceQueryCaps (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336330] nvidia_uvm: Unknown symbol nvUvmInterfaceUnsetPageDirectory (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336364] nvidia_uvm: Unknown symbol nvUvmInterfaceInitAccessCntrInfo (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336397] nvidia_uvm: Unknown symbol nv_kthread_q_flush (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336430] nvidia_uvm: Unknown symbol nvUvmInterfaceReleaseChannel (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336464] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryAllocSys (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336503] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryCpuMap (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336559] nvidia_uvm: Unknown symbol nvUvmInterfaceRetainChannelResources (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336592] nvidia_uvm: Unknown symbol nvUvmInterfacePmaFreePages (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336632] nvidia_uvm: Unknown symbol nvUvmInterfaceSetPageDirectory (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336665] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryCpuUnMap (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336699] nvidia_uvm: Unknown symbol nv_kthread_q_schedule_q_item (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336734] nvidia_uvm: Unknown symbol nvUvmInterfaceOwnPageFaultIntr (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336782] nvidia_uvm: Unknown symbol nvUvmInterfaceDupAddressSpace (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336820] nvidia_uvm: Unknown symbol nvUvmInterfaceGetExternalAllocPtes (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336861] nvidia_uvm: Unknown symbol nvUvmInterfaceRegisterGpu (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336896] nvidia_uvm: Unknown symbol nvUvmInterfaceP2pObjectDestroy (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336933] nvidia_uvm: Unknown symbol nvUvmInterfaceGetNonReplayableFaults (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.336966] nvidia_uvm: Unknown symbol nvUvmInterfaceGetFbInfo (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337016] nvidia_uvm: Unknown symbol nvUvmInterfaceRetainChannel (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337063] nvidia_uvm: Unknown symbol nvUvmInterfaceHasPendingNonReplayableFaults (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337098] nvidia_uvm: Unknown symbol nvUvmInterfaceDestroyAccessCntrInfo (err 0)
Jan 8 19:47:32 thinkstation kernel: [ 6662.337150] nvidia_uvm: Unknown symbol nvUvmInterfaceStopChannel (err 0)

etc…