Runtime problem with PGFORTRAN

We have a HPE server equipped with two NVIDIA K40 accelerators used for parallelization with OpenACC and FORTRAN.
the compilation is done successfully : pgfortran -acc -Minfo -ta=nvidia -fast vecAdd.f90 -o vec.out but when I try to execute ./vec.out I have this error message :

Current file: /home/instm/ALI/INSTMCOTRHD/IntissarP/exemples/vecAdd.f90
function: main
line: 23
This file was compiled: -ta=tesla:cc30,cc35,cc50,cc60,cc70

if I remove the option -ta=nvidia compilation and execution are done but the execution time is more than compiling with gfortran(whitout using the accelerators).

Please Can you help me ?
Thx

Please post the output of
pgaccelinfo

Does it work after running
sudo nvidia-smi
once?

[root@localhost ~]# pgaccelinfo

CUDA Driver Version: 10010
No accelerators found.
Try pgaccelinfo -v for more information

[root@localhost ~]# sudo nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Ok, looks like the nvidia driver went missing.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

the requested report is attached

nvidia-bug-report.log.gz (125 KB)

The driver is correctly installed but doesn’t load. Unfortunately I can’t see why because the dmesg log is flooded with acpi messages, so I can only guess.
Please reboot the server and check if secure boot is enabled in bios. If so, disable it. Right after reboot, please create a new nvidia-bug-report.log.

A new report is attached

nvidia-bug-report.log.gz (88 KB)

The driver is now loading but there’s a library mismatch:

/bin/nvidia-smi --query

Failed to initialize NVML: Driver/library version mismatch

This often happens when two different drivers are installed, one over another.
Please post the output of the cuda devicequery demo:
sudo /pathtodemos/deviceQuery

[root@localhost ~]# /pathtodemos/deviceQuery
-bash: /pathtodemos/deviceQuery: Aucun fichier ou dossier de ce type

Of course you have to replace “pathtodemos” with the real path to the cuda demos. It depends on the cuda toolkit version you installed. Might be /usr/local/cuda-10.1/extras/demo_suite/ or similar

cd /usr/local/cuda-10.1/extras/demo_suite/
[root@localhost demo_suite]# ll
total 84
-rw-r–r–. 1 root root 83682 24 juin 17:38 nvidia-bug-report.log.gz

try using locate to find it
locate deviceQuery

Maybe also check which cuda version is installed:
yum list installed “cuda*”

[root@localhost demo_suite]# locate deviceQuery
/home/instm/deviceQuery.cuf
/home/instm/deviceQuery.out
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/Makefile
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery/Makefile
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery/deviceQuery.out
/opt/pgi/linux86-64/2018/examples/OpenACC/SDK/src/deviceQuery
/opt/pgi/linux86-64/2018/examples/OpenACC/SDK/src/deviceQuery/Makefile
/opt/pgi/linux86-64/2018/examples/OpenACC/SDK/src/deviceQuery/deviceQuery.c
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/Makefile
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/SDK/deviceQuery
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/SDK/deviceQuery/Makefile
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/SDK/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64-llvm/2018/examples/OpenACC/SDK/src/deviceQuery
/opt/pgi/linux86-64-llvm/2018/examples/OpenACC/SDK/src/deviceQuery/Makefile
/opt/pgi/linux86-64-llvm/2018/examples/OpenACC/SDK/src/deviceQuery/deviceQuery.c
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/Makefile
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/NsightEclipse.xml
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery.cpp
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery.o
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/readme.txt
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/Makefile
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/NsightEclipse.xml
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/deviceQueryDrv.cpp
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/readme.txt
/usr/local/cuda-10.1/samples/bin/x86_64/linux/release/deviceQuery

[root@localhost demo_suite]# cd /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery
[root@localhost deviceQuery]# ll
total 688
-rwxr-xr-x. 1 root root 647424 24 juin 16:47 deviceQuery
-rw-r–r–. 1 root root 12473 25 avril 03:25 deviceQuery.cpp
-rw-r–r–. 1 root root 15248 24 juin 13:38 deviceQuery.o
-rw-r–r–. 1 root root 10964 7 mai 00:19 Makefile
-rw-r–r–. 1 root root 1789 25 avril 03:25 NsightEclipse.xml
-rw-r–r–. 1 root root 168 25 avril 03:25 readme.txt

Please post the output of running
sudo /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery

./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “Tesla K40m”
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11441 MBytes (11996954624 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:

Device 1: “Tesla K40m”
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11441 MBytes (11996954624 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 135 / 0
  Compute Mode:

Peer access from Tesla K40m (GPU0) → Tesla K40m (GPU1) : Yes
 Peer access from Tesla K40m (GPU1) → Tesla K40m (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS

Looks good, so pgaccelinfo should now display correct info and your application should work but you should look into why there’s a library mismatch for libnvidia-ml.so reported by nvidia-smi.

[root@localhost ~]# pgaccelinfo

CUDA Driver Version: 10010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.67 Sat Apr 6 03:07:24 CDT 2019

Device Number: 0
Device Name: Tesla K40m
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35

Device Number: 1
Device Name: Tesla K40m
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Please post the output of
locate libnvidia-ml
to check which version you have installed.

locate libnvidia-ml

/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/lib/libnvidia-ml.so.418.40.04
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.418.40.04
/usr/local/cuda-10.1/targets/x86_6/4linux/lib/stubs/libnvidia-ml.so