Problem:Fortran code with open ACC doesn't gain any speed up

I am new to the open ACC, and encountered a puzzling problem. I was learning the open ACC using a code in the PGI AcceleratorTM Compilers OpenACC Getting Started Guide at:PGI Compilers & Tools.
The code is on pages 13 and 14 of the document.

I used the PGI Visual Fortran with Visual Studion shell, and the code was compiled for both “enable open ACC directive” and no ACC directive. However, after ran both versions, to my surprise the time used for both the versions is almost identical (I changed the n to a much large value of 500000000 to allow sufficient work load). In other words, the open ACC didn’t make the program run faster in this case.

The accelerator of my computer is: NVIDIA Quadro K2000M. It is a Win 7 Quad Core laptop.

Any advice and suggestions will be highly appreciated.

Hi rzou1,

Assuming that you’re running the vecadd example, when using n=500000000 the total memory usage will ~5.5GB. Since your card only as 2GB I would expect the run to get an out of memory error if it ran on the device. Does it seem to work if you change n back to it’s original value?

If not, then you may not have the appropriate flags set to generate the OpenACC, or your device may not be configured so the code is run on the host (by default both a host and GPU version of the code is created)

What does your build log look like? What is the output of the “pgaccelinfo” utility (run from a PGI DOS command shell)?

  • Mat

Hi Mat, thanks for your quick response. Here is how I did the compiling in the IDE of the Visual Studio:

  1. Properties–>Fortran–>Preprocess–>Preprocess source file;
  2. Properties—>Fortran–>Language–>Enable Open ACC Directives
  3. Properties–>Fortran–>Target Accelerators–>Target NVIDIA TESLA
  4. Properties–>Fortran–>Target Host

I reduced the n to n = 50000000, and it takes the code 2.4 seconds to run

Then I remove the “target accelerators”, i.e., only with the 1), 2), and 4) above, compile, then it takes about 2.4 seconds to run the code.

I then turn off the Open ACC directive, i.e., only with 1) above, and compile, then it takes about 2.4 seconds to run the code.

BTW, when I activate all the 1) to 4), the compiling information is as:
vecaddgpu:

17, Generating copyin(a(:n))
Generating copyin(b(:n))
Generating copyout(r(:n))
Generating NVIDIA code
18, Loop is parallelizable
Accelerator kernel generated
18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Linking…
vector_add build succeeded.

Thanks a lot!

Hi Mat, the Build.log is like:


Compiling Project …

…..\vector_addition_from_the_startup_guide.f90

c:\program files\pgi\win64\14.1\bin\pgfortran.exe -Hx,123,8 -Hx,123,0x40000 -Hx,0,0x40000000 -Mx,0,0x40000000 -Hx,0,0x20000000 -Mpreprocess -g -Bstatic -Mbackslash -acc -Mfree -I"c:\program files\pgi\win64\14.1\include" -I"C:\Program Files\PGI\Microsoft Open Tools 12\include" -I"C:\Program Files (x86)\Windows Kits\8.1\Include\shared" -I"C:\Program Files (x86)\Windows Kits\8.1\Include\um" -ta=tesla,host -Minform=warn -module “x64\Debug” -Minfo=accel -o “x64\Debug\vector_addition_from_the_startup_guide.obj” -c “C:\Allstuff\EFDC\CUDA_open_ACC_related\Open_ACC_examples\Vector_addition\vector_addition_from_the_startup_guide.f90”

Command exit code: 0

Command output: [NOTE: your trial license will expire in 12 days, 6.56 hours. NOTE: your trial license will expire in 12 days, 6.56 hours. vecaddgpu: 17, Generating copyin(a(:n)) Generating copyin(b(:n)) Generating copyout(r(:n)) Generating NVIDIA code 18, Loop is parallelizable Accelerator kernel generated 18, !$acc loop gang, vector(128) ! blockidx%x threadidx%x ]

Linking…

c:\program files\pgi\win64\14.1\bin\pgfortran.exe -Wl,/libpath:“c:\program files\pgi\win64\14.1\lib” -Wl,/libpath:“C:\Program Files\PGI\Microsoft Open Tools 12\lib\amd64” -Wl,/libpath:“C:\Program Files (x86)\Windows Kits\8.1\Lib\winv6.3\um\x64” -Yl,“C:\Program Files\PGI\Microsoft Open Tools 12\bin\amd64” -g -Bstatic -acc -ta=tesla,host -o “C:\Allstuff\EFDC\CUDA_open_ACC_related\Open_ACC_examples\Vector_addition\vector_add\vector_add\x64\Debug\vector_add.exe” “x64\Debug\vector_addition_from_the_startup_guide.obj”

Command exit code: 0

vector_add build succeeded.

It looks like it’s building fine. Try adding the “-ta=tesla:time” in the “Command Line” options in the property pages, or sent the environment variable “PGI_ACC_TIME=1” in the DOS cmd window and run your program from there. The program should then print out profiling information if it ran on the GPU.

One thing to keep in mind is that vecadd is a very simple example with very little computation. Hence, you may not see much speed-up. Instead, you might want try the Matmul example in: C:\Program Files (x86)\Microsoft Visual Studio 11.0\PGI Visual Fortran\Samples\gpu\AccelPM_Matmul

Note that your path to the Matmul may be different. Also, this example uses the PGI Accelerator Model which is the precursor to OpenACC’s “kernel” construct.

  • Mat

Hi Mat,
I tried the “-ta=tesla:time”, but it still doesn’t produce speed up for the case. I tried to profile as you adviced, but got an error that the “libaccprof.dll” not found. I copied the .dll to the folder where I have the executable, but still doesn’t work.
I then turned to test the AccelPM_Matmul case you suggested. For this case, I compiled with the following options:

  1. no ACC
  2. enable ACC, but no target accelerator
  3. enable ACC and target accelerator to NVIDIA Tesla

Also, I set the size to 2048 (from 1024) to make the problem large enough. The time took for 1) is about 8 second, and for both 2) and 3) are 6.2 second, which show about 30% speedup. However, when run the executable with 2) and 3), I frequently encounter the problem: cudastreamsynchronize returned error 702: launch timeout

And when this error occur, the display of the computer blank out, with the message:“Display driver stop working…”. I wonder does this always happen when use GPU to do the computation? is there anyway to avoid this problem?

This problem occur about 50% of the time when the size of the problem is 2048, but when I increase the size to 3072, it occur 100% of the time.

Thanks very much!

The Windows display driver model (WDDM) will time out all but the shortest of jobs. Your options are to increase the timeout value (in the registry) or, since you have a Quadro, use the nvdia-smi utility to switch to Tesla Compute Cluster (TCC) mode.

  • Mat

thanks Matt. I tried to made the computation task more intensive, then I can see the speed up of the ACC.
BTW, I have got a question about the accelerator profiling. I noticed that the code that I compiled with the PGI fortran doesn’t show consistent behavoir regarding the accelerator profiling, i.e., sometimes it can generate accelerator profiling information, but sometimes it just generate nothing about the accelerator’s performance.
I used the Visual Fortran IDE, and specify the accelerator profiling at the "Perperties–>Debug–>accelerator profiling (yes)

truely appreciate your help.

i.e., sometimes it can generate accelerator profiling information, but sometimes it just generate nothing about the accelerator’s performance.

Is there anything different between when the profile is printed versus when it’s not? It’s odd behavior that I don’t have an answer for.

“Properties–>Debug–>accelerator profiling” sets the environment variable PGI_ACC_TIME=1. So it could be not getting set in some cases, though I don’t know why. Another possibility is that it’s not running on GPU (by default a unified binary is created which will create a host and GPU version of the code, i.e. -ta=host,tesla). Finally, maybe the output is getting lost somehow?

Some things to try are setting “-ta=tesla:time” (so profiling is always on). This will also remove the host version.

  • Mat