Linux server unable to recognize GPU

We are having repository issues on our Linux servers. So could not get you the output. So as per our server versions provided below can you let us the know the correct packages which need to be installed on our servers:-

  1. SUSE Linux Enterprise Server 12 SP2 (x86_64) - Kernel \r (\l)

  2. SUSE Linux Enterprise Server 12 SP2 (x86_64) - Kernel \r (\l)

On the server you provided the log for, the driver was running.
Problem is, there are two ways to install the driver

  • runfile installer
  • repo package
    Mixing both can cause major breakage so it’s crucial to know which method was used initially. So without knowing that, I can’t give you any advice on what package/repo/installer to use.
    What kind of “repository issues” are you running into?

cdcvillx279:/var/log # zypper search nvidia
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…

S | Name | Summary | Type
—±----------------------------------------------±----------------------------------------------------------------------±-------
i | nvidia-computeG04 | NVIDIA driver for computing with GPGPU | package
i+ | nvidia-diag-driver-local-repo-sles12-tr5 | nvidia-diag-driver-local repository configuration files | package
i+ | nvidia-diag-driver-local-repo-sles122-384.145 | nvidia-diag-driver-local repository configuration files | package
| nvidia-diagnosticG04 | Diagnostic utilities for the NVIDIA driver | package
i | nvidia-gfxG04-kmp-default | NVIDIA graphics driver kernel module for GeForce 400 series and newer | package
i | nvidia-glG04 | NVIDIA OpenGL libraries for OpenGL acceleration | package
i | x11-video-nvidiaG04 | NVIDIA graphics driver for GeForce 400 series and newer | package
cdcvillx279:/var/log # zypper search cuda
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…

S | Name | Summary | Type
—±----------------------------±---------------------------------------------------±-------
| cuda | CUDA meta-package | package
| cuda-9-0 | CUDA 9.0 meta-package | package
i | cuda-command-line-tools-9-0 | CUDA command-line tools | package
i | cuda-core-9-0 | CUDA core tools | package
i | cuda-cublas-9-0 | CUBLAS native runtime libraries | package
i | cuda-cublas-dev-9-0 | CUBLAS native dev links, headers | package
i | cuda-cudart-9-0 | CUDA Runtime native runtime libraries | package
i | cuda-cudart-dev-9-0 | CUDA Runtime native dev links, headers | package
i | cuda-cufft-9-0 | CUFFT native runtime libraries | package
i | cuda-cufft-dev-9-0 | CUFFT native dev links, headers | package
i | cuda-curand-9-0 | CURAND native runtime libraries | package
i | cuda-curand-dev-9-0 | CURAND native dev links, headers | package
i | cuda-cusolver-9-0 | CUSOLVER native runtime libraries | package
i | cuda-cusolver-dev-9-0 | CUSOLVER native dev links, headers | package
i | cuda-cusparse-9-0 | CUSPARSE native runtime libraries | package
i | cuda-cusparse-dev-9-0 | CUSPARSE native dev links, headers | package
| cuda-demo-suite-9-0 | Set of pre-built demos using CUDA | package
i | cuda-documentation-9-0 | CUDA documentation | package
i | cuda-driver-dev-9-0 | CUDA Driver native dev stub library | package
i+ | cuda-drivers | CUDA Driver meta-package | package
| cuda-drivers-diagnostic | CUDA Driver diagnostic meta-package | package
| cuda-gdb-src-9-0 | Contains the source code for cuda-gdb | package
i | cuda-libraries-9-0 | CUDA Libraries 9.0 meta-package | package
i | cuda-libraries-dev-9-0 | CUDA Libraries 9.0 development meta-package | package
i | cuda-license-9-0 | CUDA licenses | package
| cuda-minimal-build-9-0 | Minimal CUDA 9.0 toolkit build packages. | package
i | cuda-misc-headers-9-0 | CUDA miscellaneous headers | package
i | cuda-npp-9-0 | NPP native runtime libraries | package
i | cuda-npp-dev-9-0 | NPP native dev links, headers | package
i | cuda-nvgraph-9-0 | NVGRAPH native runtime libraries | package
i | cuda-nvgraph-dev-9-0 | NVGRAPH native dev links, headers | package
i | cuda-nvml-dev-9-0 | NVML native dev links, headers. | package
i | cuda-nvrtc-9-0 | NVRTC native runtime libraries | package
i | cuda-nvrtc-dev-9-0 | NVRTC native dev links, headers | package
i+ | cuda-repo-sles122-9-0-local | cuda repository configuration files | package
| cuda-runtime-9-0 | CUDA Runtime 9.0 meta-package | package
i | cuda-samples-9-0 | Contains an extensive set of example CUDA programs | package
i | cuda-toolkit-9-0 | CUDA Toolkit 9.0 meta-package | package
i | cuda-visual-tools-9-0 | CUDA visual tools | package

For our other server we are getting the below output:
mphpcadmin@cdcvillx141:~> sudo zypper search “nvidia*”
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.

mphpcadmin@cdcvillx141:~> zypper search cuda
Loading repository data…
Reading installed packages…
No matching items found.

Users are still not able to submit jobs using GPUs through these servers. The error log is stating: Error initializing the CUDA Driver NO_DEVICE WARNING: GPUAcceleration disabled

On the sever with cuda installed, please run
/usr/local/cuda-9.0/extras/demo_suite/deviceQuery
and post its output.
on the other server, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

For server with cuda installed we do not have demo_suite directories in extras, we only have the following directories:
cdcvillx279:/usr/local/cuda-9.0/extras # ls -l
total 0
drwxr-xr-x 5 root root 66 Apr 29 2019 CUPTI
drwxr-xr-x 4 root root 52 Apr 29 2019 Debugger

For the other server please find attached bug-report.nvidia-bug-report.log.gz (84.1 KB)

On the server with cuda installed, please install the demo suite
zypper install cuda-demo-suite-9-0
then try to run /usr/local/cuda-9.0/extras/demo_suite/deviceQuery again.

On both servers, please check if dkms is installed
zypper search “dkms*”

I installed cuda-demo-suite-9-0. The output from deviceQuery is listed below.

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “Tesla K80”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “Tesla K80”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 133 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from Tesla K80 (GPU0) → Tesla K80 (GPU1) : Yes
Peer access from Tesla K80 (GPU1) → Tesla K80 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2, Device0 = Tesla K80, Device1 = Tesla K80
Result = PASS

mphpcadmin@cdcvillx279:~> zypper search “dkms*”
Loading repository data…
Reading installed packages…
No matching items found.
mphpcadmin@cdcvillx279:~> sudo zypper search “dkms*”
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.

mphpcadmin@cdcvillx141:~> sudo zypper search “dkms*”
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.
mphpcadmin@cdcvillx141:~> ^C
mphpcadmin@cdcvillx141:~> sudo zypper search dkms
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.
mphpcadmin@cdcvillx141:~> zypper search dkms
Loading repository data…
Reading installed packages…
No matching items found.

The server without cuda installed is defunct with some driver left-overs from unknown origin, has to be cleaned and a new driver installed. Let’s put that aside for later.

The server with cuda installed (server A) seems to be fully functional. What kind of jobs/applications are started there that fail with “CUDA Driver NO_DEVICE WARNING”?

We use Abaqus standard which is a finite element solver to run jobs.

Which abaqus version do you have installed? On older version, it might be necessary to set the gpus to exclusive mode:
https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/abaqus/

Please post the output of
dkms status
and
ls -l /usr/lib64/libGL*
on both servers.

We are using Abaqus 2017 and 2019 version on these servers. Please find the below output as asked:
cdcvillx279:~ # dkms status
If ‘dkms’ is not a typo you can use command-not-found to lookup the package that contains it, like this:
cnf dkms
cdcvillx279:~ # ls -l /usr/lib64/libGL*
lrwxrwxrwx 1 root root 18 May 8 2020 /usr/lib64/libGLESv2.so.2 → libGLESv2.so.2.0.0
-rwxr-xr-x 1 root root 30544 Oct 14 2016 /usr/lib64/libGLESv2.so.2.0.0
lrwxrwxrwx 1 root root 14 May 8 2020 /usr/lib64/libGL.so.1 → libGL.so.1.2.0
lrwxrwxrwx 1 root root 14 May 8 2020 /usr/lib64/libGL.so.1.2 → libGL.so.1.2.0
-rwxr-xr-x 1 root root 430760 Oct 14 2016 /usr/lib64/libGL.so.1.2.0
lrwxrwxrwx 1 root root 15 Mar 29 2019 /usr/lib64/libGLU.so.1 → libGLU.so.1.3.1
-rwxr-xr-x 1 root root 449360 Oct 14 2016 /usr/lib64/libGLU.so.1.3.1
cdcvillx279:~ # cnf dkms
dkms: command not found

the other server without CUda installed:

cdcvillx141:~ # dkms status
If ‘dkms’ is not a typo you can use command-not-found to lookup the package that contains it, like this:
cnf dkms
cdcvillx141:~ # ls -l /usr/lib64/libGL*
-rwxr-xr-x 1 root root 732400 Oct 30 2019 /usr/lib64/libGLdispatch.so.0
lrwxrwxrwx 1 root root 32 Oct 30 2019 /usr/lib64/libGLESv1_CM_nvidia.so.1 → libGLESv1_CM_nvidia.so.418.87.01
-rwxr-xr-x 1 root root 60936 Oct 30 2019 /usr/lib64/libGLESv1_CM_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 17 Oct 30 2019 /usr/lib64/libGLESv1_CM.so → libGLESv1_CM.so.1
lrwxrwxrwx 1 root root 21 Oct 30 2019 /usr/lib64/libGLESv1_CM.so.1 → libGLESv1_CM.so.1.2.0
-rwxr-xr-x 1 root root 43696 Oct 30 2019 /usr/lib64/libGLESv1_CM.so.1.2.0
lrwxrwxrwx 1 root root 29 Oct 30 2019 /usr/lib64/libGLESv2_nvidia.so.2 → libGLESv2_nvidia.so.418.87.01
-rwxr-xr-x 1 root root 110808 Oct 30 2019 /usr/lib64/libGLESv2_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 14 Oct 30 2019 /usr/lib64/libGLESv2.so → libGLESv2.so.2
lrwxrwxrwx 1 root root 18 Apr 20 2020 /usr/lib64/libGLESv2.so.2 → libGLESv2.so.2.1.0
-rwxr-xr-x 1 root root 30544 Aug 7 2017 /usr/lib64/libGLESv2.so.2.0.0
-rwxr-xr-x 1 root root 83280 Oct 30 2019 /usr/lib64/libGLESv2.so.2.1.0
-rw-r–r-- 1 root root 665 Oct 30 2019 /usr/lib64/libGL.la
lrwxrwxrwx 1 root root 10 Oct 30 2019 /usr/lib64/libGL.so → libGL.so.1
lrwxrwxrwx 1 root root 14 Apr 20 2020 /usr/lib64/libGL.so.1 → libGL.so.1.7.0
lrwxrwxrwx 1 root root 14 Apr 20 2020 /usr/lib64/libGL.so.1.2 → libGL.so.1.2.0
-rwxr-xr-x 1 root root 430760 Aug 7 2017 /usr/lib64/libGL.so.1.2.0
-rwxr-xr-x 1 root root 685848 Oct 30 2019 /usr/lib64/libGL.so.1.7.0
lrwxrwxrwx 1 root root 15 Oct 29 2019 /usr/lib64/libGLU.so.1 → libGLU.so.1.3.1
-rwxr-xr-x 1 root root 449360 Oct 14 2016 /usr/lib64/libGLU.so.1.3.1
lrwxrwxrwx 1 root root 26 Oct 30 2019 /usr/lib64/libGLX_indirect.so.0 → libGLX_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 26 Oct 30 2019 /usr/lib64/libGLX_nvidia.so.0 → libGLX_nvidia.so.418.87.01
-rwxr-xr-x 1 root root 1275664 Oct 30 2019 /usr/lib64/libGLX_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 11 Oct 30 2019 /usr/lib64/libGLX.so → libGLX.so.0
-rwxr-xr-x 1 root root 65096 Oct 30 2019 /usr/lib64/libGLX.so.0
cdcvillx141:~ # cnf dkms
dkms: command not found

we have one server where GPUs are working fine. So do you think we are missing some driver installation

On server A, driver and cuda are running fine according to deviceQuery. Abaqus has a minimum cuda driver 7.5 requirement, don’t know about abaqus 2019, there are no publicly available docs.
Please create a nvidia-bug-report.log on the server where abaqus works, also run
export >export.txt
on it and attach that. Please also note down the exact abaqus version that’s running on that server. Maybe this will shed some light on the situation.

Hi Generix, please find attached nvidia bug log from the server where the GPUs are working, Most of the users are using Abaqus 2017 version nvidia-bug-report.log.gz (1.5 MB)

That’s a very simple runfile install of the 418 driver. Nothing special, only a newer driver. Maybe try this on the defunct server without cuda installed, download
https://http.download.nvidia.com/XFree86/Linux-x86_64/460.67/NVIDIA-Linux-x86_64-460.67.run
and run it with option

--no-opengl-files

if it asks to run nvidia-x-config, choose ‘no’.
Then reboot and check if abaqus works.
Don’t do this on the server with cuda installed.

Hi thanks this solution worked on the server where the cuda was not installed. GPU is started working. So what should we do for the server where cuda is installed?