Nvml error: driver/library version mismatch

Hi,
Getting docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown. this error while running docker for Cuopt .


Please guide me through this

hello @nida.bijapure

This can happen when the kernel is using a different version of the nvidia driver than the client program. You can try this, there may be an error message there

dmesg | grep NVRM

Here are a couple of things to try:

  1. do you get the same error from “nvidia-smi” (assuming it is installed)
nvidia-smi
  1. have you tried simply rebooting? If the driver version has been updated, a reboot is necessary.

Let’s start there.

Hi,
I didn’t what exactly are you trying to say. Can you please elaborate?

Hi @nida.bijapure

These steps may help diagnose the issue. The error can happen when there is a mismatch between a client program or packages on the system and the version of the nvidia driver that is being used by the kernel.

The following command on a Linux system might give us extra information from the system logs:

$ dmesg | grep NVRM

If you have the nvidia-smi executable installed on your system, that might give us a clue too (if nvidia-smi returns a result, but docker has errors, then we know it’s something specific to the docker setup). Run it like this:

$ nvidia-smi

Lastly, if there has been an nvidia driver update, but the system has not been rebooted since the update, rebooting the machine may clear the issue.

After following the steps you mentioned I got the following error

please guide me through the next steps.

@nida.bijapure

okay that gives some clarity. The driver version is older than the installed cuda version, most likely from mixed install methods.

Your best course of action is follow this page from NVIDIA

specifically this section:

Incidentally, if you can create a fresh Ubuntu 22.04 machine, you can use this script to install everything you need for cuOpt. It is super-simple and it works well. Are you able to create a new Ubuntu 22.04 machine?
https://ngc.nvidia.com/resources/ea-reopt-member-zone:setup_ubuntu_for_cuopt

Same error here.

 ❯ sudo  dmesg | grep NVRM

[3705011.768121] NVRM: API mismatch: the client has the version 535.129.03, but
                 NVRM: this kernel module has the version 535.104.05.  Please
                 NVRM: make sure that this kernel module and all NVIDIA driver
                 NVRM: components have the same version.

which section and which command should I run to downgrade the client version?

This worked for me.

Rebooting works, but only temporarily. Even without updating any drivers the system refuses to start new containers after some time. This could be hours, days or weeks but it does happen without any apparent reason. It’s driving our operations team nuts, as a hard reboot (of a production system) is the only option. It would be really valuable if anyone has a suggestion on how to debug this.