I am new to the open ACC, and encountered a puzzling problem. I was learning the open ACC using a code in the PGI AcceleratorTM Compilers OpenACC Getting Started Guide at:PGI Compilers & Tools.
The code is on pages 13 and 14 of the document.
I used the PGI Visual Fortran with Visual Studion shell, and the code was compiled for both “enable open ACC directive” and no ACC directive. However, after ran both versions, to my surprise the time used for both the versions is almost identical (I changed the n to a much large value of 500000000 to allow sufficient work load). In other words, the open ACC didn’t make the program run faster in this case.
The accelerator of my computer is: NVIDIA Quadro K2000M. It is a Win 7 Quad Core laptop.
Any advice and suggestions will be highly appreciated.
Assuming that you’re running the vecadd example, when using n=500000000 the total memory usage will ~5.5GB. Since your card only as 2GB I would expect the run to get an out of memory error if it ran on the device. Does it seem to work if you change n back to it’s original value?
If not, then you may not have the appropriate flags set to generate the OpenACC, or your device may not be configured so the code is run on the host (by default both a host and GPU version of the code is created)
What does your build log look like? What is the output of the “pgaccelinfo” utility (run from a PGI DOS command shell)?
It looks like it’s building fine. Try adding the “-ta=tesla:time” in the “Command Line” options in the property pages, or sent the environment variable “PGI_ACC_TIME=1” in the DOS cmd window and run your program from there. The program should then print out profiling information if it ran on the GPU.
One thing to keep in mind is that vecadd is a very simple example with very little computation. Hence, you may not see much speed-up. Instead, you might want try the Matmul example in: C:\Program Files (x86)\Microsoft Visual Studio 11.0\PGI Visual Fortran\Samples\gpu\AccelPM_Matmul
Note that your path to the Matmul may be different. Also, this example uses the PGI Accelerator Model which is the precursor to OpenACC’s “kernel” construct.
Hi Mat,
I tried the “-ta=tesla:time”, but it still doesn’t produce speed up for the case. I tried to profile as you adviced, but got an error that the “libaccprof.dll” not found. I copied the .dll to the folder where I have the executable, but still doesn’t work.
I then turned to test the AccelPM_Matmul case you suggested. For this case, I compiled with the following options:
no ACC
enable ACC, but no target accelerator
enable ACC and target accelerator to NVIDIA Tesla
Also, I set the size to 2048 (from 1024) to make the problem large enough. The time took for 1) is about 8 second, and for both 2) and 3) are 6.2 second, which show about 30% speedup. However, when run the executable with 2) and 3), I frequently encounter the problem: cudastreamsynchronize returned error 702: launch timeout
And when this error occur, the display of the computer blank out, with the message:“Display driver stop working…”. I wonder does this always happen when use GPU to do the computation? is there anyway to avoid this problem?
This problem occur about 50% of the time when the size of the problem is 2048, but when I increase the size to 3072, it occur 100% of the time.
The Windows display driver model (WDDM) will time out all but the shortest of jobs. Your options are to increase the timeout value (in the registry) or, since you have a Quadro, use the nvdia-smi utility to switch to Tesla Compute Cluster (TCC) mode.
thanks Matt. I tried to made the computation task more intensive, then I can see the speed up of the ACC.
BTW, I have got a question about the accelerator profiling. I noticed that the code that I compiled with the PGI fortran doesn’t show consistent behavoir regarding the accelerator profiling, i.e., sometimes it can generate accelerator profiling information, but sometimes it just generate nothing about the accelerator’s performance.
I used the Visual Fortran IDE, and specify the accelerator profiling at the "Perperties–>Debug–>accelerator profiling (yes)
i.e., sometimes it can generate accelerator profiling information, but sometimes it just generate nothing about the accelerator’s performance.
Is there anything different between when the profile is printed versus when it’s not? It’s odd behavior that I don’t have an answer for.
“Properties–>Debug–>accelerator profiling” sets the environment variable PGI_ACC_TIME=1. So it could be not getting set in some cases, though I don’t know why. Another possibility is that it’s not running on GPU (by default a unified binary is created which will create a host and GPU version of the code, i.e. -ta=host,tesla). Finally, maybe the output is getting lost somehow?
Some things to try are setting “-ta=tesla:time” (so profiling is always on). This will also remove the host version.