Runtime problem with PGFORTRAN

chriaa.intissar · September 12, 2019, 2:31pm

We have a HPE server equipped with two NVIDIA K40 accelerators used for parallelization with OpenACC and FORTRAN.
the compilation is done successfully : pgfortran -acc -Minfo -ta=nvidia -fast vecAdd.f90 -o vec.out but when I try to execute ./vec.out I have this error message :

Current file: /home/instm/ALI/INSTMCOTRHD/IntissarP/exemples/vecAdd.f90
function: main
line: 23
This file was compiled: -ta=tesla:cc30,cc35,cc50,cc60,cc70

if I remove the option -ta=nvidia compilation and execution are done but the execution time is more than compiling with gfortran(whitout using the accelerators).

Please Can you help me ?
Thx

generix · September 12, 2019, 3:06pm

Please post the output of
pgaccelinfo

Does it work after running
sudo nvidia-smi
once?

chriaa.intissar · September 13, 2019, 9:08am

[root@localhost ~]# pgaccelinfo

CUDA Driver Version: 10010
No accelerators found.
Try pgaccelinfo -v for more information

[root@localhost ~]# sudo nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

generix · September 13, 2019, 9:12am

Ok, looks like the nvidia driver went missing.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

chriaa.intissar · September 13, 2019, 9:22am

the requested report is attached

nvidia-bug-report.log.gz (125 KB)

generix · September 13, 2019, 9:32am

The driver is correctly installed but doesn’t load. Unfortunately I can’t see why because the dmesg log is flooded with acpi messages, so I can only guess.
Please reboot the server and check if secure boot is enabled in bios. If so, disable it. Right after reboot, please create a new nvidia-bug-report.log.

chriaa.intissar · September 13, 2019, 10:17am

A new report is attached

nvidia-bug-report.log.gz (88 KB)

generix · September 13, 2019, 10:42am

The driver is now loading but there’s a library mismatch:

/bin/nvidia-smi --query

Failed to initialize NVML: Driver/library version mismatch

This often happens when two different drivers are installed, one over another.
Please post the output of the cuda devicequery demo:
sudo /pathtodemos/deviceQuery

chriaa.intissar · September 13, 2019, 10:47am

[root@localhost ~]# /pathtodemos/deviceQuery
-bash: /pathtodemos/deviceQuery: Aucun fichier ou dossier de ce type

generix · September 13, 2019, 10:56am

Of course you have to replace “pathtodemos” with the real path to the cuda demos. It depends on the cuda toolkit version you installed. Might be /usr/local/cuda-10.1/extras/demo_suite/ or similar

chriaa.intissar · September 13, 2019, 11:01am

cd /usr/local/cuda-10.1/extras/demo_suite/
[root@localhost demo_suite]# ll
total 84
-rw-r–r–. 1 root root 83682 24 juin 17:38 nvidia-bug-report.log.gz

generix · September 13, 2019, 11:10am

try using locate to find it
locate deviceQuery

generix · September 13, 2019, 11:12am

Maybe also check which cuda version is installed:
yum list installed “cuda*”

chriaa.intissar · September 13, 2019, 11:26am

[root@localhost demo_suite]# locate deviceQuery
/home/instm/deviceQuery.cuf
/home/instm/deviceQuery.out
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/Makefile
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery/Makefile
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64/2018/examples/CUDA-Fortran/SDK/deviceQuery/deviceQuery.out
/opt/pgi/linux86-64/2018/examples/OpenACC/SDK/src/deviceQuery
/opt/pgi/linux86-64/2018/examples/OpenACC/SDK/src/deviceQuery/Makefile
/opt/pgi/linux86-64/2018/examples/OpenACC/SDK/src/deviceQuery/deviceQuery.c
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/Makefile
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/CUDA-Fortran-Book/chapter1/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/SDK/deviceQuery
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/SDK/deviceQuery/Makefile
/opt/pgi/linux86-64-llvm/2018/examples/CUDA-Fortran/SDK/deviceQuery/deviceQuery.cuf
/opt/pgi/linux86-64-llvm/2018/examples/OpenACC/SDK/src/deviceQuery
/opt/pgi/linux86-64-llvm/2018/examples/OpenACC/SDK/src/deviceQuery/Makefile
/opt/pgi/linux86-64-llvm/2018/examples/OpenACC/SDK/src/deviceQuery/deviceQuery.c
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/Makefile
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/NsightEclipse.xml
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery.cpp
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery.o
/usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/readme.txt
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/Makefile
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/NsightEclipse.xml
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/deviceQueryDrv.cpp
/usr/local/cuda-10.1/samples/1_Utilities/deviceQueryDrv/readme.txt
/usr/local/cuda-10.1/samples/bin/x86_64/linux/release/deviceQuery

[root@localhost demo_suite]# cd /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery
[root@localhost deviceQuery]# ll
total 688
-rwxr-xr-x. 1 root root 647424 24 juin 16:47 deviceQuery
-rw-r–r–. 1 root root 12473 25 avril 03:25 deviceQuery.cpp
-rw-r–r–. 1 root root 15248 24 juin 13:38 deviceQuery.o
-rw-r–r–. 1 root root 10964 7 mai 00:19 Makefile
-rw-r–r–. 1 root root 1789 25 avril 03:25 NsightEclipse.xml
-rw-r–r–. 1 root root 168 25 avril 03:25 readme.txt

generix · September 13, 2019, 12:48pm

Please post the output of running
sudo /usr/local/cuda-10.1/samples/1_Utilities/deviceQuery/deviceQuery

chriaa.intissar · September 13, 2019, 10:05pm

./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “Tesla K40m”
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11441 MBytes (11996954624 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:

Device 1: “Tesla K40m”
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11441 MBytes (11996954624 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 135 / 0
  Compute Mode:

Peer access from Tesla K40m (GPU0) → Tesla K40m (GPU1) : Yes
Peer access from Tesla K40m (GPU1) → Tesla K40m (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 2
Result = PASS

generix · September 14, 2019, 5:05pm

Looks good, so pgaccelinfo should now display correct info and your application should work but you should look into why there’s a library mismatch for libnvidia-ml.so reported by nvidia-smi.

chriaa.intissar · September 16, 2019, 9:20am

[root@localhost ~]# pgaccelinfo

CUDA Driver Version: 10010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.67 Sat Apr 6 03:07:24 CDT 2019

Device Number: 0
Device Name: Tesla K40m
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35

Device Number: 1
Device Name: Tesla K40m
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

generix · September 16, 2019, 5:56pm

Please post the output of
locate libnvidia-ml
to check which version you have installed.

chriaa.intissar · September 16, 2019, 9:23pm

locate libnvidia-ml

/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/lib/libnvidia-ml.so.418.40.04
/usr/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.418.40.04
/usr/local/cuda-10.1/targets/x86_6/4linux/lib/stubs/libnvidia-ml.so

Topic		Replies	Views
error for a simple OPENACC program Legacy PGI Compilers	23	12115	May 16, 2013
Check performance Legacy PGI Compilers	4	3325	September 28, 2017
run on K40 Linux	83	5605	June 29, 2018
matrix reduction using cuda fortran and GPU Legacy PGI Compilers	33	13878	December 21, 2012
Openacc, command exited with non_zero status 1 nvc, nvc++ and nvfortran cuda , ubuntu	19	1605	October 10, 2021
Accelerator Fatal Error: No NVIDIA/CUDA version... Legacy PGI Compilers	12	14867	May 15, 2017
OpenACC Region: Command exited with non-zero status 1 nvc, nvc++ and nvfortran cuda	21	2173	October 14, 2021
performance of PGI openacc directives Legacy PGI Compilers	9	5151	March 6, 2013
No Available accelerator Legacy PGI Compilers	7	6685	November 9, 2016
Problem:Fortran code with open ACC doesn't gain any speed up Legacy PGI Compilers	8	6784	February 12, 2014

Runtime problem with PGFORTRAN

Related topics