application not accelerating with openACC

Hi all,

I have this 3D fluid dynamics application I have been working with trying to accelerate.
I am using:
pgcc 16.4-0 64-bit target on x86-64 Linux -tp haswell

I wanted to create a serial version first.
Initially I compiled the application with the -fast -Mnovect options to have a base to compare with. Then I used -fast -Mvect=simd and got a speed up of 1.56.

My next step was to use openMP. I got a speed up of about 4 with 4 threads with respect to the serial version without vectorization.

Now I am trying to see how much speedup I can get using my GPU, which are a couple of GeForce GT 750M with 2048 MB total memory each.

However the best I can get is about 1.5 slower than the serial version without vectorization, which is very disappointed. Profiling the application shows that data traffic is kept to a minimum. I am beginning to think that my GPU are not powerful enough for this application despite the fact than for other application it has performed very acceptable. Unfortunately I do not have access to other GPUs + PGI compilers to try.

I wonder if someone would like to give it a try. I have the code files, a makefile and a data file.

Hi efblack2,

Sure, I can take a look. Please either post a link to your source or send the package to PGI Customer Service ( and ask them to forward it to me. Please include all versions (OpenACC, OpenMP, etc.) and work loads.

  • Mat

Hi Mat,

Thanks for the quick response,

I just sent a file (numFlDyn3d.tar.gz) to

Inside, there is a data file called lowRes.txt. To run the program just provide the name of this data file as a command parameter.

Also inside the tar file you will find 3 makefiles (Makefile1, Makefile2, Makefile3) and a soft link to a Makefile.

Makefile1 will create the serial version w/o any vectorization.
Makefile2 will create the serial version with vectorization.
Makefile3 will create the openACC version. This is the version I am having problem with. I am providing makefiles for the serial versions only for reference.

I hope you have all the required information to run the programs.

Thanks again,

Edgar Black

Hi Edgar,

I looked over the code. The biggest issue I see is that you’re putting “malloc” in device code. A device size malloc is very slow. To fix, I malloc’d these temp arrays on the host and them put them into a “private” clause. This helped overall performance by about 25%.

Also, the workload seems fairly small. My guess is as you increase the size of the workload, you’ll also see much better relative performance.

  • Mat

Hi Mat,

Thanks again for the advice about malloc.

I have not get any improvement in th last week.

I wonder if memory alignment could be an issue. All my arrays are allocated using malloc in the host and pcopy entering the data region.

Could memory alignment be a factor here? if yes, how can enforce alignment in the kernels?


Edgar Black

Hi Edgar,

Memory is aligned by default, so I doubt that would be a problem. Though you do have a couple of spots where you have memory divergence since your accessing non-coalesced across a vector loop. For example:

                    #pragma acc loop vector
                    for (int level=k1; level<=k2; ++level) {
                        fld[level] = 0.5*(w[level][row][col] + w[level+1][row][col]);

Your options are to add “const” to “w”'s declaration to say it’s read-only and that the compiler can put the array in textured memory, or re-order the dimensions so that “levels” is used in the stride-1 dimension. However given “w” is written too later, “const” wont work. Also, re-ordering the array will mean reordering it through-out your program which may be cumbersome.

  • Mat