Nvidia-smi giving error "failed to initialize NVML"

Over the weekend my test maching with CUDA 5.0 configured hit a bit of a snag. We were able to call nvidia-smi without any issues previously, but today I get the error:

# nvidia-smi
Failed to intialize NVML: Function not found.


You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).

The only helpful text it gives is that I should always run with libnvidia-ml.so in /usr/lib and /usr/lib64, and the files are both in the appropriate location. I’m concerned that a user may have preformed an update that is causing the hangup, but I’m not sure where to begin searching for a solution. I can still compile and execute CUDA code, so the problem seems to be limited to nvidia-smi.

Any input on where to search for problematic files would be appreciated.

I thought I had solved the problem by reverting my K20m driver version to 304.54 (the version bundled with CUDA 5.0) which solved the problem running nvidia-smi. However, I still get the above error (failed to initialize nvml) when I run any code that links against nvml.h. The language of the warning seems to imply that the problem comes from my libnvidida-ml.so needing to be the version supplied by the driver I have installed (304.54). When I check the shared objects in /usr/lib and /usr/lib64, they both point to the 304.54 versions (see bwlow), so I am a little confused.

$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root     17 Jul 17 11:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root     22 Jul 17 11:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 17 11:08 /usr/lib64/libnvidia-ml.so.304.54

Am I misunderstanding the wording of the warning? I am trying to avoid completely reinstalling CUDA 5 but I fear I may have no choice.

This question was answered successfully on StackOverflow. In case anyone else has the same issue.