OpenMP and Accelerator directives

Hi all,

I have a loop that is parallelizable. When I use openmp directives, keeping all the variables private but the result, I obtain the same results that if I run without the OpenMP flag. However, when I change the to the !$acc pragma, keeping the same variables private, the results are completly different. How is that possible?

Here is the parallelizable loop using both architectures,

!$omp parallel
!$omp do private(Vbond,Vangle,r1x,r2x), &
!$omp private(r1y,r2y,r1z,r2z), &
!$omp private(r,a,r_1,r_2,th,costh,i)
do 101 i=1,resids
Vbond=0.0D0
Vangle=0.0D0
Vdieh=0.0D0

r1x=(r_n(i,1)-r_ca(i,1))
r1y=(r_n(i,2)-r_ca(i,2))
r1z=(r_n(i,3)-r_ca(i,3))
r=(r1x2+r1y2+r1z**2)**0.50D0
r_1=r
Vbond=Vbond+0.50D0kbond(r-ro_nca)**2

r2x=(r_c(i,1)-r_ca(i,1))
r2y=(r_c(i,2)-r_ca(i,2))
r2z=(r_c(i,3)-r_ca(i,3))
r=(r2x2+r2y2+r2z**2)**0.50D0
r_2=r
Vbond=Vbond+0.50D0kbond(r-ro_cac)**2

a=r1xr2x+r1yr2y+r1zr2z
costh=a/(r_1
r_2)
th=acos(costh)
Vangle=Vangle+0.50D0kangle(th-tho_ncac)**2

!$omp critical
E(i)=Vangle+Vbond
!$omp end critical

101 continue

!$omp end do
!$omp end parallel


!$acc region do copyin(r_n,r_ca,r_c), copy(E), &
!$acc private(Vbond,Vangle,r1x,r2x,r1y,r2y,r1z,r2z), &
!$acc private(r,a,r_1,r_2,th,costh,i)
do 101 i=1,resids
Vbond=0.0D0
Vangle=0.0D0
Vdieh=0.0D0

rx=(r_n(i,1)-r_ca(i,1))
ry=(r_n(i,2)-r_ca(i,2))
rz=(r_n(i,3)-r_ca(i,3))
r1x=rx
r1y=ry
r1z=rz
r=(rx2+ry2+rz**2)**0.50D0
r_1=r
Vbond=Vbond+0.50D0kbond(r-ro_nca)**2

rx=(r_c(i,1)-r_ca(i,1))
ry=(r_c(i,2)-r_ca(i,2))
rz=(r_c(i,3)-r_ca(i,3))
r2x=rx
r2y=ry
r2z=rz
r=(rx2+ry2+rz**2)**0.50D0
r_2=r
Vbond=Vbond+0.50D0kbond(r-ro_cac)**2

a=r1xr2x+r1yr2y+r1zr2z
costh=a/(r_1
r_2)
th=acos(costh)
Vangle=Vangle+0.50D0kangle(th-tho_ncac)**2

E(i)=Vangle+Vbond

101 continue

Hi Marco,

How is that possible?

The only thing that jumps out is that your using acos, square root, and exponential operations which can be relatively imprecise on a GPU. Is your code precision sensitive?

One thing to try is to store your intermediary calculations in temporary arrays and compare the CPU and GPU results to determine where the divergence occurs.

On a side note, scalar variables are implicitly private in the Accelerator model. While it doesn’t hurt to declare them private, it isn’t necessary. Also, you can use the “copyout” clause for E an save some data movement costs.

Hope this helps,
Mat

Thanks for the help. I did that and I could find where is the problem. I have some if statements inside the loop which allow to do extra operations for some of the values, something like this

!$acc region do copyin(r,y), copyout(E)
do i=1,n
V=r(i)**2
if (y(i).eq.1) then
V=V+r(i)**3
endif
E(i)=V
enddo

The problem is that it doesn’t access to what is inside the if statement. I don’t know why is occuring it and how to solve it. Could you give me some suggestions?


Thanks,
Marco

Hi Marco,

The problem is that it doesn't access to what is inside the if statement.

This is a compiler bug. I just found it myself yesterday and reported it to our engineers as TPR#16426. I consider this as critical bug that must be fixed soon.

In the mean time, you might be able to work around the bug by using an undocumented flag “-ta=nvidia,oldcg”. In 10.0 we implemented a code generator which does give better performance, but obvious still has a few problems. “oldcg” will use our previous code generator.

I apologize that our internal testing missed this error and hopefully can have it fixed by early next year.

  • Mat

Mat,

Thanks for the hint, but PVF doesn’t recognize that flag. I included it in the command line, and it gave me this message

Compiling Project …
Energy_4bead_GPU.f90
-ta=nvidia:{analysis|nofma|keepbin|keepptx|keepgpu|maxregcount:|cc10|cc11|cc13|fastmath|mul24|time}|host
Choose target accelerator
nvidia Select NVIDIA accelerator target
analysis Analysis only, no code generation
nofma Don’t generate fused mul-add instructions
keepbin Keep kernel .bin files
keepptx Keep kernel .ptx files
keepgpu Keep kernel source files
maxregcount:
Set maximum number of registers to use on the GPU
cc10 Compile for compute capability 1.0
cc11 Compile for compute capability 1.1
cc13 Compile for compute capability 1.3
fastmath Use fast math library
mul24 Use 24-bit multiplication for subscripting
time Collect simple timing information
host Compile for the host, i.e., no accelerator target
pgf95-Error-Switch -ta with unknown keyword oldcg
pgf95-Error-The -ta switch must specify an accelerator target

Energy_GPU build failed.

Probably I will have to wait until that bug is solved.

Marco

Probably I will have to wait until that bug is solved.

Sorry, I didn’t realize you were on Windows. Windows only uses the new code generator, hence the “oldcg” is not available.

I’m pushing to get this fixed soon, but because of the upcoming Winter break, it wont be until early next year.

  • Mat

Hi Marco,

FYI, TPR#16426 has been fixed for the 10.1 release due out later this week.

  • Mat