syntax error ?

alechand · May 22, 2013, 6:31pm

Hello,
can you help me is this simple peace of code ?

###########################

!$acc region
do i=1,N

x(i)=x(i)+dx

c1=0 ; c2=0
do j=1,N
if (x(i)<xold(j)) then
c1=c1+1
endif

if (x(i)>xold(j)) then
c2=c2+1
endif
enddo
Fx=dble(c1-c2)/dble(N)

v(i)=v(i)+dv

F(i)=Fx

enddo !!! end i
!$acc end region

###########################

i dont know why, but it is not reproducing the same result
as if it is compiled and executed with no parallel processor.

The compilation output seems to be ok:

###########################3

34, Loop unrolled 16 times
35, Loop unrolled 4 times
40, Loop unrolled 8 times
68, Loop unrolled 16 times
83, Generating present_or_copy(f(1:8192))
Generating present_or_copy(v(1:8192))
Generating present_or_copy(x(1:8192))
Generating present_or_copyin(xold(1:8192))
Generating compute capability 2.0 binary
84, Loop is parallelizable
Accelerator kernel generated
84, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 2.0 : 27 registers; 0 shared, 104 constant, 0 local memory bytes
91, Loop is parallelizable
121, Loop unrolled 16 times
energy:
147, Loop unrolled 4 times

#############################

I think the problem is with the variables inside the loop in " j ". How can i tell the compilator that for each " i " it has their own c1,c2 and Fx ?

Thank you very much

MatColgrove · May 22, 2013, 6:57pm

Hi alechand,

I think the problem is with the variables inside the loop in " j ". How can i tell the compilator that for each " i " it has their own c1,c2 and Fx ?

Scalar variables such as c1, c2, and Fx are privatized by default, so each thread does have it’s own copy. I doubt this is the problem.

i dont know why, but it is not reproducing the same result
as if it is compiled and executed with no parallel processor.

Can you post or send to PGI Customer Service (trs@pgroup.com) a reproducing example? I can’t tell what’s wrong from this snip-it.

Mat

alechand · May 22, 2013, 7:38pm

Hi,
i found the problem. In the beggining of the code i have to obtain some random numbers. I was comparing the results of using the pgfortran and gfortran. But each compilator get different random numbers with the standard seed…

Now, using for comparison always the pgfortran, it is working.

Thanks

alechand · May 24, 2013, 1:45pm

Hi Mat,
Coming back to the same previous problem. The results are not the same (comparing with a host execution) if i increase the TIME variable in my code:

#######################

!$acc data

do t=1,TIME

!$acc region
do i=1,N

x(i)=x(i)+DT1v(i)+DT2F(i)

!!! calculate forces
!Fx=0.
c1=0 ; c2=0
do j=1,N
if (x(i)<xold(j)) then
!Fx=Fx+1./dble(N)
c1=c1+1
endif

if (x(i)>xold(j)) then
!Fx=Fx-1./dble(N)
c2=c2+1
endif
enddo
Fx=dble(c1-c2)/dble(N)
!!! calculate forces_end

v(i)=v(i)+DT1/2.*( F(i) + Fx )

F(i)=Fx

enddo !!! end i
!$acc end region

xold=x

enddo !!! end t

!$acc end data

############################

for example, if i use N=500 and TIME=100000,
the results are different, but i i use TIME=10000, they are the same.

It seems the compiler is not capable to calculate things when i increase TIME…

Can you help ? thanks

MatColgrove · May 24, 2013, 10:41pm

Hi alechand,

My best guess would be it’s because you’re overflowing the data range of x, but there’s not enough info here to be sure. Please either post or send me a reproducing test case and I’ll see I can determine the problem.

Mat

alechand · May 24, 2013, 11:58pm

Thanks Mat,
i sent you an email.

MatColgrove · May 28, 2013, 3:54pm

Hi alechand,

It appears to be the difference is being caused by accumulated rounding error when using fused-multiply-add (FMA) instructions. The same issue can be seen on the CPU when using higher optimizations. Try adding the flag “-ta=nvidia,nofma” to see if this helps.

Mat

alechand · May 28, 2013, 6:01pm

Mat,
unfortunately, this did not help.

I was thinking, if the results agree “exactly” using TIME=100000,
how can be accumulated errors ?

Do you have other idea?
I really appreciate your attention.

MatColgrove · May 28, 2013, 8:45pm

We were able to trace this down. It looks like that at least a few x(i) and xold(j) values are within a small margin of error difference. With slight changes in precision in these cases, the dominant value may flip-flop leading to divergent values of c1 and c2. This then has a cascading effect which leads to the eventual wrong answer.

Mat

Topic		Replies	Views
Vector array assignments within a $acc parallel region Legacy PGI Compilers	13	10958	November 27, 2013
PGI attempts to parallelize sequential loop Legacy PGI Compilers	3	2614	August 28, 2012
FATAL ERROR at run time Legacy PGI Compilers	5	8122	December 18, 2014
compiler ask acc routine information for internal function Legacy PGI Compilers	12	20324	October 25, 2017
dependence in loop prevents parallelization Legacy PGI Compilers	3	8776	February 9, 2010
No parallel kernels found, accelerator region ignored Legacy PGI Compilers	3	8450	February 11, 2010
Nested loops and zeroed variables Legacy PGI Compilers	5	5055	May 14, 2013
MatMul with openACC Legacy PGI Compilers	7	13046	December 17, 2012
Need help to accelerate Legacy PGI Compilers	3	2609	November 26, 2012
Matrix multiplication parallelizing Legacy PGI Compilers	4	6910	June 1, 2010

syntax error ?

Related topics