Since apt-packages cannot be bound to the version, which are actually compatible with the kernel module package installed, I’m trying to write a script, which is able to tell one, how to fix a ‘Failed to initialize NVML: Driver/library version mismatch’ problem. Unfortunately I’ve not found the related source or any other information, how a “mismatch” gets determined. E.g. the nvidia kernel model has version x.y.z. Which version of libnvidia-compute is compatible? x.a.b? x.y.c? Only x.y.z?
The versions of the kernel module and libcuda have to match exactly.
e.g. nvidia.ko 430.14 needs exactly libcuda.so.430.14
There are only two sources of mismatch I can think of apart from harddisk failures and such:
- driver was updated
- Fix: reboot
- a different driver version installed over another
- Fix: uninstall the driver and cuda, then reinstall driver and cuda-toolkit-10-1 (for cuda 10.1)
OK, thank you very much!
Wrt. sources of mismatch: there are much more. E.g.: Once you have installed e.g. nvidia-dkms-418 and later you install nvidia-utils, apt-get would install the latest release of nvidia-utils, which does not necessarily match the version of the installed nvidia-dkms-418.
Wrt. the fix: uninstall/install the monster meta-packages may fix the problem, yepp. But that’s a big hammer, too coarse grained and IIUIC not needed on bare metal a.k.a global zones, which just drive the non-global zones alias containers (e.g. nvidia-docker-zones), where the actual work is done.
BTW: Finished the script so far: http://iks.cs.ovgu.de/~elkner/nv/nvidia-fixPackages.sh ;-)