Runtime problem with PGFORTRAN

generix · September 16, 2019, 9:51pm

Some .run installer driver wasn’t cleanly uninstalled. Please remove those files:
sudo rm /usr/lib/libnvidia-ml.so* /usr/lib64/libnvidia-ml.so*
and try reinstalling the correct libs:
sudo yum reinstall nvidia-driver-NVML

chriaa.intissar · September 17, 2019, 9:24am

you find an image attached to be clearer

generix · September 17, 2019, 9:37am

Ok, please try
cd /var/cuda-repo-10-1-local-10.1.168-418.67/
sudo rpm -ivh --force nvidia-driver-NVML-418.67-4.el7.x86_64.rpm

chriaa.intissar · September 17, 2019, 9:55am

[root@localhost ~]# cd /var/cuda-repo-10-1-local-10.1.168-418.67/
[root@localhost cuda-repo-10-1-local-10.1.168-418.67]# sudo rpm -ivh --force nvidia-driver-NVML-418.67-4.el7.x86_64.rpm
Préparation… ################################# [100%]
Mise à jour / installation…
1:nvidia-driver-NVML-3:418.67-4.el7################################# [100%]
[root@localhost cuda-repo-10-1-local-10.1.168-418.67]#

generix · September 17, 2019, 9:56am

Does nvidia-smi work now?

chriaa.intissar · September 17, 2019, 9:58am

yes

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 11072 G /usr/bin/X 9MiB |
| 1 14364 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

generix · September 17, 2019, 10:05am

Everything looks fine now, your application should work.

chriaa.intissar · September 17, 2019, 10:10am

thank you,
yes my application works correctly
pgfortran -acc -ta=nvidia -fast -Minfo vecAdd.f90 -o vec.out

chriaa.intissar · October 2, 2019, 12:05pm

I return back to you, sorry for taking much time
1 - When I runned nvidia-smi I don’t pay attention for the Persistence-M : OFF (see attachment)
2 - Following a tutorial on the internet, I execute the code once with !$acc directives and once without !$acc directives but I’m surprised ! the excution time is the same
Maybe the accelerators are not functional !

generix · October 2, 2019, 12:24pm

You compiled those with -ta=multicore, so cuda and the gpus are not used at all, everything is done on cpu.
Persistence mode is depreciated, instead the nvidia-persistenced should be configured and started on boot. Please check if it is installed:
sudo systemctl status nvidia-persistenced

chriaa.intissar · October 2, 2019, 12:56pm

[instm@localhost step1]$ sudo systemctl status nvidia-persistenced
[sudo] Mot de passe de instm :
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[instm@localhost step1]$

I change -ta=nvidia (see attachement)

generix · October 2, 2019, 1:08pm

Just run
sudo systemctl enable nvidia-persistenced
sudo systemctl start nvidia-persistenced
to have it running.
Please post the output of
pgaccelinfo

chriaa.intissar · October 2, 2019, 1:12pm

[instm@localhost step1]$ sudo systemctl enable nvidia-persistenced
[sudo] Mot de passe de instm :
Created symlink from /etc/systemd/system/multi-user.target.wants/nvidia-persistenced.service to /usr/lib/systemd/system/nvidia-persistenced.service.
[instm@localhost step1]$
[instm@localhost step1]$
[instm@localhost step1]$ sudo systemctl start nvidia-persistenced
[instm@localhost step1]$
[instm@localhost step1]$
[instm@localhost step1]$ pgaccelinfo

CUDA Driver Version: 10010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.67 Sat Apr 6 03:07:24 CDT 2019

Device Number: 0
Device Name: Tesla K40m
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35

Device Number: 1
Device Name: Tesla K40m
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
[instm@localhost step1]$

generix · October 2, 2019, 1:32pm

Everything looks correctly in place and functional.
Do you get better results when omitting -fast optimizations and use -Minfo=accel to get some more info about accelleration used? i.e.
pgfortran -acc -ta=tesla:cc35 -Minfo=accel laplace2d.f90 -o lpc

chriaa.intissar · October 2, 2019, 1:39pm

with acc directives

[instm@localhost step1]$ pgfortran -acc -fast -ta=tesla:cc35 -Minfo=accel laplace2d.f90 -o lpc
laplace:
75, Generating implicit copyout(anew(1:4094,1:4094))
Generating implicit copyin(a(0:4095,0:4095))
76, Loop is parallelizable
77, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
76, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
77, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
80, Generating implicit reduction(max:error)
90, Generating implicit copyin(anew(1:4094,1:4094))
Generating implicit copyout(a(1:4094,1:4094))
91, Loop is parallelizable
92, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
91, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
92, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
[instm@localhost step1]$ time ./lpc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
completed in 84.123 seconds

real 1m24.263s
user 1m17.431s
sys 0m6.841s
[instm@localhost step1]$

without acc directives

[instm@localhost step1]$ pgfortran -acc -fast -ta=tesla:cc35 -Minfo=accel laplace2d.f90 -o lpc
[instm@localhost step1]$ time ./lpc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
completed in 35.714 seconds

real 0m35.747s
user 0m35.699s
sys 0m0.041s
[instm@localhost step1]$

generix · October 2, 2019, 1:47pm

I told to omit -fast. This is doing cpu optimizations.

chriaa.intissar · October 3, 2019, 9:46am

without acc directives

[instm@localhost step1]$ pgfortran -acc -ta=nvidia -Minfo=accel laplace2d.f90 -o lpc
[instm@localhost step1]$ time ./lpc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
completed in 53.059 seconds

real 0m53.094s
user 0m53.035s
sys 0m0.053s

with acc directives

[instm@localhost step1]$ pgfortran -acc -ta=nvidia -Minfo=accel laplace2d.f90 -o lpc
laplace:
75, Generating implicit copyout(anew(1:4094,1:4094))
Generating implicit copyin(a(0:4095,0:4095))
76, Loop is parallelizable
77, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
76, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
77, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
80, Generating implicit reduction(max:error)
90, Generating implicit copyin(anew(1:4094,1:4094))
Generating implicit copyout(a(1:4094,1:4094))
91, Loop is parallelizable
92, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
91, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
92, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
[instm@localhost step1]$ time ./lpc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
completed in 84.195 seconds

real 1m24.346s
user 1m17.734s
sys 0m6.616s
[instm@localhost step1]$

chriaa.intissar · October 4, 2019, 10:43am

Have you an idea what is the problem please ?

generix · October 4, 2019, 11:03am

No idea, maybe ask at the pgi forums, on the cuda side, everything looks correct.

generix · October 4, 2019, 12:28pm

Maybe also ask here:
[url]CUDA Programming and Performance - NVIDIA Developer Forums