These steps may help diagnose the issue. The error can happen when there is a mismatch between a client program or packages on the system and the version of the nvidia driver that is being used by the kernel.
The following command on a Linux system might give us extra information from the system logs:
$ dmesg | grep NVRM
If you have the nvidia-smi executable installed on your system, that might give us a clue too (if nvidia-smi returns a result, but docker has errors, then we know it’s something specific to the docker setup). Run it like this:
Lastly, if there has been an nvidia driver update, but the system has not been rebooted since the update, rebooting the machine may clear the issue.
❯ sudo dmesg | grep NVRM
[3705011.768121] NVRM: API mismatch: the client has the version 535.129.03, but
NVRM: this kernel module has the version 535.104.05. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
which section and which command should I run to downgrade the client version?
Rebooting works, but only temporarily. Even without updating any drivers the system refuses to start new containers after some time. This could be hours, days or weeks but it does happen without any apparent reason. It’s driving our operations team nuts, as a hard reboot (of a production system) is the only option. It would be really valuable if anyone has a suggestion on how to debug this.