Problem:Fortran code with open ACC doesn't gain any speed up

rzou1 · February 6, 2014, 4:52pm

I am new to the open ACC, and encountered a puzzling problem. I was learning the open ACC using a code in the PGI AcceleratorTM Compilers OpenACC Getting Started Guide at:PGI Compilers & Tools.
The code is on pages 13 and 14 of the document.

I used the PGI Visual Fortran with Visual Studion shell, and the code was compiled for both “enable open ACC directive” and no ACC directive. However, after ran both versions, to my surprise the time used for both the versions is almost identical (I changed the n to a much large value of 500000000 to allow sufficient work load). In other words, the open ACC didn’t make the program run faster in this case.

The accelerator of my computer is: NVIDIA Quadro K2000M. It is a Win 7 Quad Core laptop.

Any advice and suggestions will be highly appreciated.

MatColgrove · February 6, 2014, 9:39pm

Hi rzou1,

Assuming that you’re running the vecadd example, when using n=500000000 the total memory usage will ~5.5GB. Since your card only as 2GB I would expect the run to get an out of memory error if it ran on the device. Does it seem to work if you change n back to it’s original value?

If not, then you may not have the appropriate flags set to generate the OpenACC, or your device may not be configured so the code is run on the host (by default both a host and GPU version of the code is created)

What does your build log look like? What is the output of the “pgaccelinfo” utility (run from a PGI DOS command shell)?

Mat

rzou1 · February 6, 2014, 10:19pm

Hi Mat, thanks for your quick response. Here is how I did the compiling in the IDE of the Visual Studio:

Properties–>Fortran–>Preprocess–>Preprocess source file;
Properties—>Fortran–>Language–>Enable Open ACC Directives
Properties–>Fortran–>Target Accelerators–>Target NVIDIA TESLA
Properties–>Fortran–>Target Host

I reduced the n to n = 50000000, and it takes the code 2.4 seconds to run

Then I remove the “target accelerators”, i.e., only with the 1), 2), and 4) above, compile, then it takes about 2.4 seconds to run the code.

I then turn off the Open ACC directive, i.e., only with 1) above, and compile, then it takes about 2.4 seconds to run the code.

BTW, when I activate all the 1) to 4), the compiling information is as:
vecaddgpu:

17, Generating copyin(a(:n))
Generating copyin(b(:n))
Generating copyout(r(:n))
Generating NVIDIA code
18, Loop is parallelizable
Accelerator kernel generated
18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Linking…
vector_add build succeeded.

Thanks a lot!

rzou1 · February 6, 2014, 10:30pm

Hi Mat, the Build.log is like:

Compiling Project …

…..\vector_addition_from_the_startup_guide.f90

c:\program files\pgi\win64\14.1\bin\pgfortran.exe -Hx,123,8 -Hx,123,0x40000 -Hx,0,0x40000000 -Mx,0,0x40000000 -Hx,0,0x20000000 -Mpreprocess -g -Bstatic -Mbackslash -acc -Mfree -I"c:\program files\pgi\win64\14.1\include" -I"C:\Program Files\PGI\Microsoft Open Tools 12\include" -I"C:\Program Files (x86)\Windows Kits\8.1\Include\shared" -I"C:\Program Files (x86)\Windows Kits\8.1\Include\um" -ta=tesla,host -Minform=warn -module “x64\Debug” -Minfo=accel -o “x64\Debug\vector_addition_from_the_startup_guide.obj” -c “C:\Allstuff\EFDC\CUDA_open_ACC_related\Open_ACC_examples\Vector_addition\vector_addition_from_the_startup_guide.f90”

Command exit code: 0

Command output: [NOTE: your trial license will expire in 12 days, 6.56 hours. NOTE: your trial license will expire in 12 days, 6.56 hours. vecaddgpu: 17, Generating copyin(a(:n)) Generating copyin(b(:n)) Generating copyout(r(:n)) Generating NVIDIA code 18, Loop is parallelizable Accelerator kernel generated 18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x ]

Linking…

c:\program files\pgi\win64\14.1\bin\pgfortran.exe -Wl,/libpath:“c:\program files\pgi\win64\14.1\lib” -Wl,/libpath:“C:\Program Files\PGI\Microsoft Open Tools 12\lib\amd64” -Wl,/libpath:“C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x64” -Yl,“C:\Program Files\PGI\Microsoft Open Tools 12\bin\amd64” -g -Bstatic -acc -ta=tesla,host -o “C:\Allstuff\EFDC\CUDA_open_ACC_related\Open_ACC_examples\Vector_addition\vector_add\vector_add\x64\Debug\vector_add.exe” “x64\Debug\vector_addition_from_the_startup_guide.obj”

Command exit code: 0

vector_add build succeeded.

MatColgrove · February 6, 2014, 10:57pm

It looks like it’s building fine. Try adding the “-ta=tesla:time” in the “Command Line” options in the property pages, or sent the environment variable “PGI_ACC_TIME=1” in the DOS cmd window and run your program from there. The program should then print out profiling information if it ran on the GPU.

One thing to keep in mind is that vecadd is a very simple example with very little computation. Hence, you may not see much speed-up. Instead, you might want try the Matmul example in: C:\Program Files (x86)\Microsoft Visual Studio 11.0\PGI Visual Fortran\Samples\gpu\AccelPM_Matmul

Note that your path to the Matmul may be different. Also, this example uses the PGI Accelerator Model which is the precursor to OpenACC’s “kernel” construct.

Mat

rzou1 · February 7, 2014, 1:50am

Hi Mat,
I tried the “-ta=tesla:time”, but it still doesn’t produce speed up for the case. I tried to profile as you adviced, but got an error that the “libaccprof.dll” not found. I copied the .dll to the folder where I have the executable, but still doesn’t work.
I then turned to test the AccelPM_Matmul case you suggested. For this case, I compiled with the following options:

no ACC
enable ACC, but no target accelerator
enable ACC and target accelerator to NVIDIA Tesla

Also, I set the size to 2048 (from 1024) to make the problem large enough. The time took for 1) is about 8 second, and for both 2) and 3) are 6.2 second, which show about 30% speedup. However, when run the executable with 2) and 3), I frequently encounter the problem: cudastreamsynchronize returned error 702: launch timeout

And when this error occur, the display of the computer blank out, with the message:“Display driver stop working…”. I wonder does this always happen when use GPU to do the computation? is there anyway to avoid this problem?

This problem occur about 50% of the time when the size of the problem is 2048, but when I increase the size to 3072, it occur 100% of the time.

Thanks very much!

MatColgrove · February 7, 2014, 10:59pm

The Windows display driver model (WDDM) will time out all but the shortest of jobs. Your options are to increase the timeout value (in the registry) or, since you have a Quadro, use the nvdia-smi utility to switch to Tesla Compute Cluster (TCC) mode.

Mat

rzou1 · February 12, 2014, 3:52am

thanks Matt. I tried to made the computation task more intensive, then I can see the speed up of the ACC.
BTW, I have got a question about the accelerator profiling. I noticed that the code that I compiled with the PGI fortran doesn’t show consistent behavoir regarding the accelerator profiling, i.e., sometimes it can generate accelerator profiling information, but sometimes it just generate nothing about the accelerator’s performance.
I used the Visual Fortran IDE, and specify the accelerator profiling at the "Perperties–>Debug–>accelerator profiling (yes)

truely appreciate your help.

MatColgrove · February 12, 2014, 4:39pm

i.e., sometimes it can generate accelerator profiling information, but sometimes it just generate nothing about the accelerator’s performance.

Is there anything different between when the profile is printed versus when it’s not? It’s odd behavior that I don’t have an answer for.

“Properties–>Debug–>accelerator profiling” sets the environment variable PGI_ACC_TIME=1. So it could be not getting set in some cases, though I don’t know why. Another possibility is that it’s not running on GPU (by default a unified binary is created which will create a host and GPU version of the code, i.e. -ta=host,tesla). Finally, maybe the output is getting lost somehow?

Some things to try are setting “-ta=tesla:time” (so profiling is always on). This will also remove the host version.

Mat

Topic		Replies	Views
OpenACC doesn't accelerate in my computer Legacy PGI Compilers	2	2238	November 15, 2017
have difficulty installing and using open ACC. Legacy PGI Compilers	5	7292	February 5, 2014
error for a simple OPENACC program Legacy PGI Compilers	23	12098	May 16, 2013
Check performance Legacy PGI Compilers	4	3324	September 28, 2017
Runtime problem with PGFORTRAN Linux	40	1649	October 7, 2019
Why my OpenACC code remains slower than OpenMP? Legacy PGI Compilers	3	4007	July 26, 2013
Poor perfomance of OpenACC code comparing to serial code Legacy PGI Compilers	3	2973	November 7, 2017
performance of PGI openacc directives Legacy PGI Compilers	9	5144	March 6, 2013
finding executed time using PGI_ACC_TIME Legacy PGI Compilers	1	2656	February 10, 2014
Openacc Example running slower with GPU nvc, nvc++ and nvfortran	7	1049	June 19, 2022

Problem:Fortran code with open ACC doesn't gain any speed up

Related topics