Linux server unable to recognize GPU

yogendra.chaudhary · March 16, 2021, 9:11pm

We are having repository issues on our Linux servers. So could not get you the output. So as per our server versions provided below can you let us the know the correct packages which need to be installed on our servers:-

SUSE Linux Enterprise Server 12 SP2 (x86_64) - Kernel \r (\l)
SUSE Linux Enterprise Server 12 SP2 (x86_64) - Kernel \r (\l)

generix · March 16, 2021, 9:22pm

On the server you provided the log for, the driver was running.
Problem is, there are two ways to install the driver

runfile installer
repo package
Mixing both can cause major breakage so it’s crucial to know which method was used initially. So without knowing that, I can’t give you any advice on what package/repo/installer to use.
What kind of “repository issues” are you running into?

yogendra.chaudhary · March 17, 2021, 7:40pm

cdcvillx279:/var/log # zypper search nvidia
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…

yogendra.chaudhary · March 17, 2021, 7:51pm

For our other server we are getting the below output:
mphpcadmin@cdcvillx141:~> sudo zypper search “nvidia*”
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.

mphpcadmin@cdcvillx141:~> zypper search cuda
Loading repository data…
Reading installed packages…
No matching items found.

yogendra.chaudhary · March 17, 2021, 8:07pm

Users are still not able to submit jobs using GPUs through these servers. The error log is stating: Error initializing the CUDA Driver NO_DEVICE WARNING: GPUAcceleration disabled

generix · March 17, 2021, 10:16pm

On the sever with cuda installed, please run
/usr/local/cuda-9.0/extras/demo_suite/deviceQuery
and post its output.
on the other server, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

yogendra.chaudhary · March 18, 2021, 3:52am

For server with cuda installed we do not have demo_suite directories in extras, we only have the following directories:
cdcvillx279:/usr/local/cuda-9.0/extras # ls -l
total 0
drwxr-xr-x 5 root root 66 Apr 29 2019 CUPTI
drwxr-xr-x 4 root root 52 Apr 29 2019 Debugger

For the other server please find attached bug-report.nvidia-bug-report.log.gz (84.1 KB)

generix · March 18, 2021, 8:19am

On the server with cuda installed, please install the demo suite
zypper install cuda-demo-suite-9-0
then try to run /usr/local/cuda-9.0/extras/demo_suite/deviceQuery again.

On both servers, please check if dkms is installed
zypper search “dkms*”

yogendra.chaudhary · March 18, 2021, 5:39pm

I installed cuda-demo-suite-9-0. The output from deviceQuery is listed below.

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “Tesla K80”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “Tesla K80”
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 133 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from Tesla K80 (GPU0) → Tesla K80 (GPU1) : Yes
Peer access from Tesla K80 (GPU1) → Tesla K80 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2, Device0 = Tesla K80, Device1 = Tesla K80
Result = PASS

yogendra.chaudhary · March 18, 2021, 5:43pm

mphpcadmin@cdcvillx279:~> zypper search “dkms*”
Loading repository data…
Reading installed packages…
No matching items found.
mphpcadmin@cdcvillx279:~> sudo zypper search “dkms*”
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.

yogendra.chaudhary · March 18, 2021, 5:43pm

mphpcadmin@cdcvillx141:~> sudo zypper search “dkms*”
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.
mphpcadmin@cdcvillx141:~> ^C
mphpcadmin@cdcvillx141:~> sudo zypper search dkms
Refreshing service ‘SUSE_Linux_Enterprise_Server_12_SP2_x86_64’.
Loading repository data…
Reading installed packages…
No matching items found.
mphpcadmin@cdcvillx141:~> zypper search dkms
Loading repository data…
Reading installed packages…
No matching items found.

generix · March 18, 2021, 10:04pm

The server without cuda installed is defunct with some driver left-overs from unknown origin, has to be cleaned and a new driver installed. Let’s put that aside for later.

The server with cuda installed (server A) seems to be fully functional. What kind of jobs/applications are started there that fail with “CUDA Driver NO_DEVICE WARNING”?

yogendra.chaudhary · March 18, 2021, 11:27pm

We use Abaqus standard which is a finite element solver to run jobs.

generix · March 19, 2021, 10:30am

Which abaqus version do you have installed? On older version, it might be necessary to set the gpus to exclusive mode:
https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/abaqus/

Please post the output of
dkms status
and
ls -l /usr/lib64/libGL*
on both servers.

yogendra.chaudhary · March 19, 2021, 8:41pm

We are using Abaqus 2017 and 2019 version on these servers. Please find the below output as asked:
cdcvillx279:~ # dkms status
If ‘dkms’ is not a typo you can use command-not-found to lookup the package that contains it, like this:
cnf dkms
cdcvillx279:~ # ls -l /usr/lib64/libGL*
lrwxrwxrwx 1 root root 18 May 8 2020 /usr/lib64/libGLESv2.so.2 → libGLESv2.so.2.0.0
-rwxr-xr-x 1 root root 30544 Oct 14 2016 /usr/lib64/libGLESv2.so.2.0.0
lrwxrwxrwx 1 root root 14 May 8 2020 /usr/lib64/libGL.so.1 → libGL.so.1.2.0
lrwxrwxrwx 1 root root 14 May 8 2020 /usr/lib64/libGL.so.1.2 → libGL.so.1.2.0
-rwxr-xr-x 1 root root 430760 Oct 14 2016 /usr/lib64/libGL.so.1.2.0
lrwxrwxrwx 1 root root 15 Mar 29 2019 /usr/lib64/libGLU.so.1 → libGLU.so.1.3.1
-rwxr-xr-x 1 root root 449360 Oct 14 2016 /usr/lib64/libGLU.so.1.3.1
cdcvillx279:~ # cnf dkms
dkms: command not found

the other server without CUda installed:

cdcvillx141:~ # dkms status
If ‘dkms’ is not a typo you can use command-not-found to lookup the package that contains it, like this:
cnf dkms
cdcvillx141:~ # ls -l /usr/lib64/libGL*
-rwxr-xr-x 1 root root 732400 Oct 30 2019 /usr/lib64/libGLdispatch.so.0
lrwxrwxrwx 1 root root 32 Oct 30 2019 /usr/lib64/libGLESv1_CM_nvidia.so.1 → libGLESv1_CM_nvidia.so.418.87.01
-rwxr-xr-x 1 root root 60936 Oct 30 2019 /usr/lib64/libGLESv1_CM_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 17 Oct 30 2019 /usr/lib64/libGLESv1_CM.so → libGLESv1_CM.so.1
lrwxrwxrwx 1 root root 21 Oct 30 2019 /usr/lib64/libGLESv1_CM.so.1 → libGLESv1_CM.so.1.2.0
-rwxr-xr-x 1 root root 43696 Oct 30 2019 /usr/lib64/libGLESv1_CM.so.1.2.0
lrwxrwxrwx 1 root root 29 Oct 30 2019 /usr/lib64/libGLESv2_nvidia.so.2 → libGLESv2_nvidia.so.418.87.01
-rwxr-xr-x 1 root root 110808 Oct 30 2019 /usr/lib64/libGLESv2_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 14 Oct 30 2019 /usr/lib64/libGLESv2.so → libGLESv2.so.2
lrwxrwxrwx 1 root root 18 Apr 20 2020 /usr/lib64/libGLESv2.so.2 → libGLESv2.so.2.1.0
-rwxr-xr-x 1 root root 30544 Aug 7 2017 /usr/lib64/libGLESv2.so.2.0.0
-rwxr-xr-x 1 root root 83280 Oct 30 2019 /usr/lib64/libGLESv2.so.2.1.0
-rw-r–r-- 1 root root 665 Oct 30 2019 /usr/lib64/libGL.la
lrwxrwxrwx 1 root root 10 Oct 30 2019 /usr/lib64/libGL.so → libGL.so.1
lrwxrwxrwx 1 root root 14 Apr 20 2020 /usr/lib64/libGL.so.1 → libGL.so.1.7.0
lrwxrwxrwx 1 root root 14 Apr 20 2020 /usr/lib64/libGL.so.1.2 → libGL.so.1.2.0
-rwxr-xr-x 1 root root 430760 Aug 7 2017 /usr/lib64/libGL.so.1.2.0
-rwxr-xr-x 1 root root 685848 Oct 30 2019 /usr/lib64/libGL.so.1.7.0
lrwxrwxrwx 1 root root 15 Oct 29 2019 /usr/lib64/libGLU.so.1 → libGLU.so.1.3.1
-rwxr-xr-x 1 root root 449360 Oct 14 2016 /usr/lib64/libGLU.so.1.3.1
lrwxrwxrwx 1 root root 26 Oct 30 2019 /usr/lib64/libGLX_indirect.so.0 → libGLX_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 26 Oct 30 2019 /usr/lib64/libGLX_nvidia.so.0 → libGLX_nvidia.so.418.87.01
-rwxr-xr-x 1 root root 1275664 Oct 30 2019 /usr/lib64/libGLX_nvidia.so.418.87.01
lrwxrwxrwx 1 root root 11 Oct 30 2019 /usr/lib64/libGLX.so → libGLX.so.0
-rwxr-xr-x 1 root root 65096 Oct 30 2019 /usr/lib64/libGLX.so.0
cdcvillx141:~ # cnf dkms
dkms: command not found

yogendra.chaudhary · March 19, 2021, 8:44pm

we have one server where GPUs are working fine. So do you think we are missing some driver installation

generix · March 19, 2021, 9:25pm

On server A, driver and cuda are running fine according to deviceQuery. Abaqus has a minimum cuda driver 7.5 requirement, don’t know about abaqus 2019, there are no publicly available docs.
Please create a nvidia-bug-report.log on the server where abaqus works, also run
export >export.txt
on it and attach that. Please also note down the exact abaqus version that’s running on that server. Maybe this will shed some light on the situation.

yogendra.chaudhary · March 22, 2021, 9:45pm

Hi Generix, please find attached nvidia bug log from the server where the GPUs are working, Most of the users are using Abaqus 2017 version nvidia-bug-report.log.gz (1.5 MB)

generix · March 22, 2021, 10:08pm

That’s a very simple runfile install of the 418 driver. Nothing special, only a newer driver. Maybe try this on the defunct server without cuda installed, download
https://http.download.nvidia.com/XFree86/Linux-x86_64/460.67/NVIDIA-Linux-x86_64-460.67.run
and run it with option

--no-opengl-files

if it asks to run nvidia-x-config, choose ‘no’.
Then reboot and check if abaqus works.
Don’t do this on the server with cuda installed.

yogendra.chaudhary · March 24, 2021, 10:10pm

Hi thanks this solution worked on the server where the cuda was not installed. GPU is started working. So what should we do for the server where cuda is installed?

Topic		Replies	Views
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	129395	February 14, 2024
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94536	December 11, 2020
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371424	March 19, 2021
OpenGL, NVIDIA and Ubuntu 14.04 issues Linux	28	17350	September 22, 2017
'No devices were found' after installing cuda 11.02 on Ubuntu 20.04 for RTX3080 Linux cuda , ubuntu , driver	19	12556	July 31, 2021
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1291	February 9, 2024
Install Problem CUDA Programming and Performance	32	12698	December 17, 2009
not able to update Tesla P100 driver 384 to 418 Linux	119	5130	November 12, 2019
Nvidia process not running Linux	25	2834	December 31, 2021
Followed guide NVIDIA CUDA Installation Guide for Linux, failing at driver install CUDA Setup and Installation cuda , ubuntu	1	1514	October 27, 2020

Linux server unable to recognize GPU

Related topics