Hi all,
I have this 3D fluid dynamics application I have been working with trying to accelerate.
I am using:
pgcc 16.4-0 64-bit target on x86-64 Linux -tp haswell
I wanted to create a serial version first.
Initially I compiled the application with the -fast -Mnovect options to have a base to compare with. Then I used -fast -Mvect=simd and got a speed up of 1.56.
My next step was to use openMP. I got a speed up of about 4 with 4 threads with respect to the serial version without vectorization.
Now I am trying to see how much speedup I can get using my GPU, which are a couple of GeForce GT 750M with 2048 MB total memory each.
However the best I can get is about 1.5 slower than the serial version without vectorization, which is very disappointed. Profiling the application shows that data traffic is kept to a minimum. I am beginning to think that my GPU are not powerful enough for this application despite the fact than for other application it has performed very acceptable. Unfortunately I do not have access to other GPUs + PGI compilers to try.
I wonder if someone would like to give it a try. I have the code files, a makefile and a data file.
Hi efblack2,
Sure, I can take a look. Please either post a link to your source or send the package to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. Please include all versions (OpenACC, OpenMP, etc.) and work loads.
Hi Mat,
Thanks for the quick response,
I just sent a file (numFlDyn3d.tar.gz) to trs@pgroup.com.
Inside, there is a data file called lowRes.txt. To run the program just provide the name of this data file as a command parameter.
Also inside the tar file you will find 3 makefiles (Makefile1, Makefile2, Makefile3) and a soft link to a Makefile.
Makefile1 will create the serial version w/o any vectorization.
Makefile2 will create the serial version with vectorization.
Makefile3 will create the openACC version. This is the version I am having problem with. I am providing makefiles for the serial versions only for reference.
I hope you have all the required information to run the programs.
Thanks again,
Edgar Black
Hi Edgar,
I looked over the code. The biggest issue I see is that you’re putting “malloc” in device code. A device size malloc is very slow. To fix, I malloc’d these temp arrays on the host and them put them into a “private” clause. This helped overall performance by about 25%.
Also, the workload seems fairly small. My guess is as you increase the size of the workload, you’ll also see much better relative performance.
Hi Mat,
Thanks again for the advice about malloc.
I have not get any improvement in th last week.
I wonder if memory alignment could be an issue. All my arrays are allocated using malloc in the host and pcopy entering the data region.
Could memory alignment be a factor here? if yes, how can enforce alignment in the kernels?
Thanks,
Edgar Black
Hi Edgar,
Memory is aligned by default, so I doubt that would be a problem. Though you do have a couple of spots where you have memory divergence since your accessing non-coalesced across a vector loop. For example:
#pragma acc loop vector
for (int level=k1; level<=k2; ++level) {
fld[level] = 0.5*(w[level][row][col] + w[level+1][row][col]);
}
Your options are to add “const” to “w”'s declaration to say it’s read-only and that the compiler can put the array in textured memory, or re-order the dimensions so that “levels” is used in the stride-1 dimension. However given “w” is written too later, “const” wont work. Also, re-ordering the array will mean reordering it through-out your program which may be cumbersome.