Hi,
I am currently testing the Tutorial2 from this site : http://www.pgroup.com/resources/articles.htm – PGI Accelerator tutorial examples
It is clearly interesting to see how various accelerator’s options could make the execution time reducing.
But, I have been surprised when, trying to execute the same program without options, (that is to says without -ta=nvidia) the execution-time that I had got was incredibly shorter.
I let you see by your own :
platform linux SENTOS 5.5 x86_64
Processor (HOST) Intel Xeon E5420 2.5 GHz
Nvidia Quadro FX 1700 + Nvidia Quadro FX 1700
cat /...l/pgi/linux86-64/10.5/bin/sitenvrc
#!/bin/sh
export NVOPEN64DIR=/.../Nvidia/cuda/3.0/open64/lib;
export CUDADIR=/.../Nvidia/cuda/3.0/bin;
export CUDALIB=/.../Nvidia/cuda/3.0/lib;
and
$ cat /.../Nvidia/cuda/3.0/Env_cuda.sh
export PATH=${PATH}:/.../Nvidia/cuda/3.0/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/.../Nvidia/cuda/3.0/lib64:/appl/Nvidia/cuda/3.0/lib
$ make clean
rm -f a.out *.exe *.o *.obj *.gpu *.bin *.ptx *.s *.mod *.g *.emu *.time *.uni
$ make J1.exe
pgfortran -ta=nvidia -fast -c Jmain.f90 -Minfo=accel
pgfortran -ta=nvidia -fast -c J1.f90 -Minfo=accel
jacobi:
18, Generating copyin(a(1:m,1:n))
Generating copyout(a(2:m-1,2:n-1))
Generating copyout(newa(2:m-1,2:n-1))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
19, Loop is parallelizable
20, Loop is parallelizable
Accelerator kernel generated
19, !$acc do parallel, vector(16)
20, !$acc do parallel, vector(16)
Cached references to size [18x18] block of 'a'
CC 1.0 : 17 registers; 1328 shared, 132 constant, 0 local memory bytes; 33 occupancy
CC 1.3 : 17 registers; 1328 shared, 132 constant, 0 local memory bytes; 75 occupancy
27, Loop is parallelizable
Accelerator kernel generated
24, Max reduction generated for change
27, !$acc do parallel, vector(16)
CC 1.0 : 9 registers; 24 shared, 116 constant, 0 local memory bytes; 100 occupancy
CC 1.3 : 9 registers; 24 shared, 116 constant, 0 local memory bytes; 100 occupancy
pgfortran -o J1.exe -ta=nvidia Jmain.o J1.o
$ ./J1.exe 500
reached delta= 0.09991 in 1624 iterations for 500 x 500 array
time=12.9260 seconds
$ ./J1.exe 1000
reached delta= 0.09998 in 3347 iterations for 1000 x 1000 array
time=91.2280 seconds
============================================================
$ make clean
rm -f a.out *.exe *.o *.obj *.gpu *.bin *.ptx *.s *.mod *.g *.emu *.time *.uni
$ pgfortran -c J1.f90
$ pgfortran -c Jmain.f90
$ pgfortran -o J1.exe Jmain.o J1.o
$ ./J1.exe 500
reached delta= 0.09995 in 1624 iterations for 500 x 500 array
time= 5.8620 seconds
$ ./J1.exe 1000
reached delta= 0.09998 in 3347 iterations for 1000 x 1000 array
time=49.7940 seconds
Conclusion :
12 seconds vs. 6 seconds
91 seconds vs. 50 seconds
Obviously my question is :
“why the time performances seems to evolute in the wrong way ?”
I suppose that I am doing somethings wrong. But I don’t know what.
I am trying to execute with much more iterations but I quickly reach the memory limit of my graphics cards.
Thank for answering.
Have a nice day.