VMD cannot detect CUDA properly

I recently installed CUDA-7.5 using the 1.5Gb file cuda_7.5.18_linux.run. I needed CUDA so that two software, VMD and Gromacs, could use the installed Teslas. During my first install attempt I remember I had to install CUDA drivers, Nvidia drivers and something called GPU Deployment Kit. I don’t recall the order in which I installed them. Regardless, the installation happened but the software always had issues detecting the GPU cards or found conflicting versions of CUDA drivers and GPU drivers. I am listing some errors at the end of this message.

Now, I want to uninstall CUDA, GDK, and NVIDIA drivers and reinstall them properly. My question is - Is there a preferred order for installing CUDA, GDK and NVIDIA? These are the files I have collected so far:

cuda_7.0.28_linux.run
gdk_linux_amd64_352_55_release.run
NVIDIA-Linux-x86_64-352.63.run

Thank you,
M

Previous errors:

Info) Multithreading available, 72 CPUs detected.
Info) Free system memory: 63648MB (98%)
FATAL: Error inserting nvidia_uvm
(/lib/modules/2.6.32-504.el6.x86_64/weak-updates/nvidia/nvidia-uvm.ko): No such
device
Warning) Detected a mismatch between CUDA runtime and GPU driver
Warning) Check to make sure that GPU drivers are up to date.
Info) No CUDA accelerator devices available.
Info) Detected 1 available TachyonL/OptiX ray tracing accelerator
Info) Dynamically loaded 2 plugins in directory:

Info) Multithreading available, 72 CPUs detected.
Info) Free system memory: 63648MB (98%)
Info) Creating CUDA device pool and initializing hardware…
Info) Detected 1 available CUDA accelerator:
Info) [0] Tesla K20c 13 SM_3.5 @ 0.71 GHz, 4.7GB RAM, AE2, ZCP
Info) Detected 1 available TachyonL/OptiX ray tracing accelerator
Info) Dynamically loaded 2 plugins in directory:

-sh-4.1$ nvidia-smi
Unable to determine the device handle for GPU 0000:04:00.0: The NVIDIA kernel
module detected an issue with GPU interrupts.Consult the “Common Problems”
Chapter of the NVIDIA Driver README for
details and steps that can be taken to resolve this issue.

-sh-4.1$ uname -a
Linux protein.x.com 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

Here is some more info about my system and software:

-sh-4.1$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015
GCC version:  gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
-sh-4.1$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.6 (Santiago)

-sh-4.1$ nvidia-smi
Failed to initialize NVML: Unknown Error

…and immediately following that…

-sh-4.1$ nvidia-smi
Tue Jan 12 15:47:59 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63     Driver Version: 352.63         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:04:00.0     Off |                    0 |
| 30%   32C    P0    49W / 225W |     12MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:84:00.0     Off |                    0 |
| 30%   35C    P0    53W / 225W |     12MiB /  4799MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
-sh-4.1$ ./cuda-tests/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery
./cuda-tests/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

The linux getting started guide gives instructions both proper install, as well as for proper removal. The GDK is a separate entity. It’s generally not necessary for a proper cuda install. I’m not aware that it would be needed for VMD or GROMACS.

Hi txbob

Before I uninstall CUDA and Nvidia drivers I wanted to check one thing. When I run Gromacs-5.1 with:

gmx mdrun -ntmpi 1 -ntomp 18 -gpu_id 0 -deffnm minim

I see this in the output:

Hardware detected:
  CPU info:
    Vendor: GenuineIntel
    Brand:  Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
    SIMD instructions most likely to fit this hardware: AVX2_256
    SIMD instructions selected at GROMACS compile time: AVX_256
  GPU info:
    Number of GPUs detected: 2
    #0: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible
    #1: NVIDIA Tesla K20c, compute cap.: 3.5, ECC: yes, stat: compatible

Compiled SIMD instructions: AVX_256, GROMACS could use AVX2_256 on this machine, which is better
Reading file minim.tpr, VERSION 5.1 (single precision)
Using 1 MPI thread
Using 18 OpenMP threads
1 GPU user-selected for this run.
Mapping of GPU ID to the 1 PP rank in this node: 0
[b]
Note: NVML support was not found (CUDA runtime 7.50, driver 7.50), so your[/b]
      Tesla K20c GPU cannot use application clock support to improve performance.

From the last line I assume that CUDA runtime and the driver version are the same. But when I run VMD, I get this:

-sh-4.1$ /scratch2/software/vmd-1.9.2/vmd-1.9.2beta1-localinstall/bin/vmd -dispdev text
Info) VMD for LINUXAMD64, version 1.9.2beta1 (September 12, 2014)
Info) http://www.ks.uiuc.edu/Research/vmd/
Info) Email questions and bug reports to vmd@ks.uiuc.edu
Info) Please include this reference in published work using VMD:
Info)    Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
Info)    Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
Info) -------------------------------------------------------------
Info) Multithreading available, 72 CPUs detected.
Info) Free system memory: 63541MB (98%)
[b]Warning) Detected a mismatch between CUDA runtime and GPU driver
Warning) Check to make sure that GPU drivers are up to date.[/b]
Info) No CUDA accelerator devices available.
Info) Dynamically loaded 2 plugins in directory:

So what is going on here? Gromacs says the CUDA runtime and driver are the same while VMD disagrees.
What do you recommend I do?

If deviceQuery is reporting what you say it is, then your CUDA install is broken. The VMD output seems to be consistent with that. I can’t really explain what is going on if GROMACS is working.

I also should correct a previous statement. It appears that GROMACS is linking against NVML, so the GDK may be needed for GROMACS. However the GDK is typically not necessary to have a functional CUDA install, and the order of installation of the GDK should not matter. I would install the GDK last, after going through the installation guide including the verification steps to verify that the CUDA install is done correctly.

deviceQuery now gives me proper output. The output is really inconsistent. I am ready to remove CUDA. Should I also remove NVidia drivers? I cannot find /usr/bin/nvidia-uninstall. Only these

-sh-4.1$ ls /usr/bin/nvidia-
nvidia-bug-report.sh     nvidia-debugdump         nvidia-settings
nvidia-cuda-mps-control  nvidia-healthmon-tests/  nvidia-smi
nvidia-cuda-mps-server   nvidia-modprobe          nvidia-xconfig
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20c"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4800 MBytes (5032706048 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
 ...
 Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K20c"
 ...
 Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K20c (GPU0) -> Tesla K20c (GPU1) : No
> Peer access from Tesla K20c (GPU1) -> Tesla K20c (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K20c, Device1 = Tesla K20c
Result = PASS

I also remember installing these packages:

Installed:
  xorg-x11-drv-nvidia.x86_64 1:352.39-1.el6

Dependency Installed:
  xorg-x11-drv-nvidia-libs.x86_64 1:352.39-1.el6

Should these also be removed?

You seem to have mixed runfile installation with installation of packages (I guess). That is a recipe for trouble.

I would follow the removal steps listed in the installation guide for both any relevant packages you may have installed via the package manager method, as well as any components (driver, CUDA) that you may have installed via the runfile installer method.

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation

Then pick one method or the other (runfile installer, or package manager) and use that method to install both the driver and the CUDA toolkit.

(/usr/bin/nvidia-uninstall would only be present if you had previously installed the driver via the runfile installer method. If it’s not there, presumably you installed the driver using the package manager method.)

I followed the compatibility matrix in the installation guide and decided to install Cuda-7.0 instead of 7.5. Since the first time I used the run method, I did the same this time w/o removing 7.5. That went well and surprisingly very quickly.

I also wanted to downgrade the Nvidia drivers. You are right about my previous attempts. I do remember trying to install via the run method but that somehow failed and then I tried the RPM method which resulted in a bunch of dependencies which I tried to fix. I do remember a major headache with libX11 and libX11-common. I first installed libX11-1.5 but then had to replace it with libX11-1.6. In the end I somehow managed to get the executable nvidia-smi at which point I gave up trying to clean the mess.

Now, I have CUDA-7.0 and Nvidia 346.46.

VMD has stopped giving the errors about
a) driver mismatch
b) cannot find CUDA
c) FATAL: Error inserting nvidia_uvm nvidia-uvm.ko No such device

Gromacs-5.1 at this time dislikes Cuda-7.0 since it was compilied using Cuda-7.5 Will report back after compiling it properly.

-sh-4.1$ /scratch2/software/cuda7.0/CUDAsamples/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release/deviceQuery
/scratch2/software/cuda7.0/CUDAsamples/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20c"
  CUDA Driver Version / Runtime Version          7.0 / 7.0
  CUDA Capability Major/Minor version number:    3.5
...

-sh-4.1$ nvidia-smi
Thu Jan 14 14:45:51 2016
+------------------------------------------------------+
| NVIDIA-SMI 346.46     Driver Version: 346.46

Dear top500, how was your gromacs reinstallation going?

Im currently trying to install gromacs 5.1.4 with CUDA support on lubuntu 15.10 (vmd 1.9.2 already detected CUDA) but the compilation gave me some errors when linking libgromacs.so.1.4.0

I used the following compilation options
cmake -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCMAKE_C_COMPILER=gcc-4.9 -DCMAKE_CXX_COMPILER=gcc-4.9 …

what is your suggestion?

lam@lam-K53SV:~/Downloads/gromacs-5.1.4/build$ nvidia-smi
Fri Oct 7 15:34:43 2016
±-----------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 540M Off | 0000:01:00.0 N/A | N/A |
| N/A 57C P12 N/A / N/A | 73MiB / 2047MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015
GCC version: gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2)

Sorry for my seemingly stupid question. I used the following compilation options and it worked

cmake -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda -DCMAKE_C_COMPILER=gcc-4.9 -DCMAKE_CXX_COMPILER=g+±4.9 …

the problem had lain in the CXX compiler flags, I think.

solved