OpenMP and Accelerator directives

RobertsGroup · December 10, 2009, 5:06pm

Hi all,

I have a loop that is parallelizable. When I use openmp directives, keeping all the variables private but the result, I obtain the same results that if I run without the OpenMP flag. However, when I change the to the !$acc pragma, keeping the same variables private, the results are completly different. How is that possible?

Here is the parallelizable loop using both architectures,

!$omp parallel
!$omp do private(Vbond,Vangle,r1x,r2x), &
!$omp private(r1y,r2y,r1z,r2z), &
!$omp private(r,a,r_1,r_2,th,costh,i)
do 101 i=1,resids
Vbond=0.0D0
Vangle=0.0D0
Vdieh=0.0D0

r1x=(r_n(i,1)-r_ca(i,1))
r1y=(r_n(i,2)-r_ca(i,2))
r1z=(r_n(i,3)-r_ca(i,3))
r=(r1x2+r1y2+r1z**2)**0.50D0
r_1=r
Vbond=Vbond+0.50D0kbond(r-ro_nca)**2

r2x=(r_c(i,1)-r_ca(i,1))
r2y=(r_c(i,2)-r_ca(i,2))
r2z=(r_c(i,3)-r_ca(i,3))
r=(r2x2+r2y2+r2z**2)**0.50D0
r_2=r
Vbond=Vbond+0.50D0kbond(r-ro_cac)**2

a=r1xr2x+r1yr2y+r1zr2z
costh=a/(r_1r_2)
th=acos(costh)
Vangle=Vangle+0.50D0kangle(th-tho_ncac)**2

!$omp critical
E(i)=Vangle+Vbond
!$omp end critical

101 continue

!$omp end do
!$omp end parallel

!$acc region do copyin(r_n,r_ca,r_c), copy(E), &
!$acc private(Vbond,Vangle,r1x,r2x,r1y,r2y,r1z,r2z), &
!$acc private(r,a,r_1,r_2,th,costh,i)
do 101 i=1,resids
Vbond=0.0D0
Vangle=0.0D0
Vdieh=0.0D0

rx=(r_n(i,1)-r_ca(i,1))
ry=(r_n(i,2)-r_ca(i,2))
rz=(r_n(i,3)-r_ca(i,3))
r1x=rx
r1y=ry
r1z=rz
r=(rx2+ry2+rz**2)**0.50D0
r_1=r
Vbond=Vbond+0.50D0kbond(r-ro_nca)**2

rx=(r_c(i,1)-r_ca(i,1))
ry=(r_c(i,2)-r_ca(i,2))
rz=(r_c(i,3)-r_ca(i,3))
r2x=rx
r2y=ry
r2z=rz
r=(rx2+ry2+rz**2)**0.50D0
r_2=r
Vbond=Vbond+0.50D0kbond(r-ro_cac)**2

a=r1xr2x+r1yr2y+r1zr2z
costh=a/(r_1r_2)
th=acos(costh)
Vangle=Vangle+0.50D0kangle(th-tho_ncac)**2

E(i)=Vangle+Vbond

101 continue

MatColgrove · December 10, 2009, 6:14pm

Hi Marco,

How is that possible?

The only thing that jumps out is that your using acos, square root, and exponential operations which can be relatively imprecise on a GPU. Is your code precision sensitive?

One thing to try is to store your intermediary calculations in temporary arrays and compare the CPU and GPU results to determine where the divergence occurs.

On a side note, scalar variables are implicitly private in the Accelerator model. While it doesn’t hurt to declare them private, it isn’t necessary. Also, you can use the “copyout” clause for E an save some data movement costs.

Hope this helps,
Mat

RobertsGroup · December 11, 2009, 6:04pm

Thanks for the help. I did that and I could find where is the problem. I have some if statements inside the loop which allow to do extra operations for some of the values, something like this

!$acc region do copyin(r,y), copyout(E)
do i=1,n
V=r(i)**2
if (y(i).eq.1) then
V=V+r(i)**3
endif
E(i)=V
enddo

The problem is that it doesn’t access to what is inside the if statement. I don’t know why is occuring it and how to solve it. Could you give me some suggestions?

Thanks,
Marco

MatColgrove · December 11, 2009, 6:57pm

Hi Marco,

The problem is that it doesn't access to what is inside the if statement.

This is a compiler bug. I just found it myself yesterday and reported it to our engineers as TPR#16426. I consider this as critical bug that must be fixed soon.

In the mean time, you might be able to work around the bug by using an undocumented flag “-ta=nvidia,oldcg”. In 10.0 we implemented a code generator which does give better performance, but obvious still has a few problems. “oldcg” will use our previous code generator.

I apologize that our internal testing missed this error and hopefully can have it fixed by early next year.

Mat

RobertsGroup · December 11, 2009, 7:45pm

Mat,

Thanks for the hint, but PVF doesn’t recognize that flag. I included it in the command line, and it gave me this message

Compiling Project …
Energy_4bead_GPU.f90
-ta=nvidia:{analysis|nofma|keepbin|keepptx|keepgpu|maxregcount:|cc10|cc11|cc13|fastmath|mul24|time}|host
Choose target accelerator
nvidia Select NVIDIA accelerator target
analysis Analysis only, no code generation
nofma Don’t generate fused mul-add instructions
keepbin Keep kernel .bin files
keepptx Keep kernel .ptx files
keepgpu Keep kernel source files
maxregcount:
Set maximum number of registers to use on the GPU
cc10 Compile for compute capability 1.0
cc11 Compile for compute capability 1.1
cc13 Compile for compute capability 1.3
fastmath Use fast math library
mul24 Use 24-bit multiplication for subscripting
time Collect simple timing information
host Compile for the host, i.e., no accelerator target
pgf95-Error-Switch -ta with unknown keyword oldcg
pgf95-Error-The -ta switch must specify an accelerator target

Energy_GPU build failed.

Probably I will have to wait until that bug is solved.

Marco

MatColgrove · December 11, 2009, 9:23pm

Probably I will have to wait until that bug is solved.

Sorry, I didn’t realize you were on Windows. Windows only uses the new code generator, hence the “oldcg” is not available.

I’m pushing to get this fixed soon, but because of the upcoming Winter break, it wont be until early next year.

Mat

MatColgrove · January 4, 2010, 7:41pm

Hi Marco,

FYI, TPR#16426 has been fixed for the 10.1 release due out later this week.

Mat

Topic		Replies	Views
Wrong results when using the private directive with PGI 12.6 Legacy PGI Compilers	9	6532	August 22, 2012
Problem with simple loop structure Legacy PGI Compilers	2	2254	March 8, 2018
Improving compiler error with OpenACC + OpenMP: "Internal compiler error. confused OMP private processing" nvc, nvc++ and nvfortran	1	475	October 18, 2021
understanding problems with acc directives. Legacy PGI Compilers	7	12763	May 3, 2010
OpenMP to PGI Accelerator Legacy PGI Compilers	1	2527	February 24, 2011
OPENACC changes value of array Legacy PGI Compilers	12	9814	May 17, 2016
OpenACC diff between GPU + CPU codes Legacy PGI Compilers	5	4101	May 31, 2012
Need advice for OpenACC directives Legacy PGI Compilers	6	7381	July 5, 2016
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	1019	April 28, 2023
OpenACC private variables Legacy PGI Compilers	1	3585	October 21, 2012

OpenMP and Accelerator directives

Related topics