 # N-body problem nested loop with OpenAcc

Hello,

I have a problem with my code in N-body problem in calculation of accelerations.

Here is my code
!\$acc kernels
!\$acc loop
do i=2,nbod
axhli = 0
ayhli = 0
azhli = 0
!\$acc loop
do j=2,nbod

*some calculations

axhl(j) = axhl(j) + something
ayhl(j) = ayhl(j) + something
azhl(j) = azhl(j) + something

axhli = axhli + something
ayhli = ayhli + something
azhli = azhli + something

enddo

axhl(i) = axhl(i) + axhli
ayhl(i) = ayhl(i) + ayhli
azhl(i) = azhl(i) + azhli
enddo

Of course I get “loop carried dependence of “axhl” prevents parallelization” error. I try to understand what i must do to achive 2D parallelizaton in my GPU with the correct results of course. If i use private clause i get 2D parallelization but i with wrong results

Thank you,
Sotiris

Hi Sotiris,

This is a tough one since the algorithm isn’t parallel. You might be able to create a 2-D temp array for each of the arrays, but you’d need to initialize them before the loop and then perform another reduction after. I’m not sure if the extra overhead would off-set any gains you achieve in paralleling the code. You’ll need to experiment.

• Mat

Thank you very much Mat for your quick reply. I want to ask you one more thing. My program is written in Fortran. Can I implement this double loop in cuda Fortran and then call this cuda function inside an OpenACC DATA REGION?

Thank you,
Sotiris

Hi Sotiris,

Can I implement this double loop in cuda Fortran and then call this cuda function inside an OpenACC DATA REGION?

In the more recent versions of the compiler, the compiler will favor the device copy of a variable contained within an OpenACC data region. So, yes, you can pass an OpenACC device data variable to a CUDA Fortran kernel from within an OpenACC data region.

The caveat being that this behaviour is non-standard. We’re looking at using a call to “deviceptr” as a standard way to make it more explicit as to which copy of the variable to use. Though, this has not been implemented as of yet.

• Mat

Hi Mat,

I want to ask you for the best way to program the following loop(OpenAcc)

``````do i=1,N
do j =1,M
A(i) = A(i) + B(i,j)
enddo
enddo
``````

I get the correct results if i serialize the inner loop(with kernels construct).

``````!\$acc kernel loop independent
do i=1,N
!\$acc loop seq independent
do j =1,M
A(i) = A(i) + B(i,j)
enddo
enddo
``````

but i don’t know the way achieve 2D parallelization. Is that possible? Or is better to use 1D vectorization with parallel construct?
Also i want to you to tell me if it is possible to use a scalar variable instead of A(i) and use the reduction clause, and at after the end of the inner loop use this variable to give the correct value to A(i).

Thank you,
Sotiris

Hi Sotiris,

To achieve 2-D parallelism here, you either need to make “A” 2-D or add a reduction.

``````!\$acc kernel loop independent
do i=1,N
sum = 0.0
!\$acc loop reduction (+:sum)
do j =1,M
sum = sum + B(i,j)
enddo
A(i) = A(i) + sum
enddo
``````

The caveat being that there is overhead in performing a reduction and an inner loop reduction limits the schedules that can be used (the inner loop can only be a “vector” and the outer loop must be a “gang”). So unless your inner loop has enough computation to offset this overhead, you may be better off just accelerating the outer loop. You’ll need to experiment as to which method works best for your particular code.

• Mat

Hello Mat,

With have a problem with our code and we can’t figure out the reason. We work on N-body problem and we’ve changed some of our FORTRAN functions using some OpenACC directives. We have only one “heavy” part, the part we calculate the accelerations( a 2D loop) with the way with have discussed in this topic. Also we are using a Tesla C1060.
Our implementation works fine with a small number of bodies but when we use a large number for our system (for example 10.000*10.000 for the 2-D in accelaration part) we get NaN in our results. We believe that this is a memory issue. Is that possible?

Another important issue is that we are using double precision variables for our code. Tesla C1060 has only 1/10 of the cores for double precision calculations. Is it possible that the device can’t execute all the double precision calculations and give us NaN?

Thanks,
Sotiris

We believe that this is a memory issue. Is that possible?

If you were running out of memory, the binary would abort execution. The C1060’s don’t have ECC memory so it could still be memory related, but I doubt it.

Try adding “-Mlarge_arrays” to your compilation. It’s possible that the index calculations need to be adjusted. Granted 10k x 10k isn’t that large so this may not be the issue.

What does the -Minfo messages say about how the loop is being scheduled?

Another important issue is that we are using double precision variables for our code. Tesla C1060 has only 1/10 of the cores for double precision calculations. Is it possible that the device can’t execute all the double precision calculations and give us NaN?

Doubtful. This just will slow you down, but shouldn’t give you NaNs.

The other thing I’d look for is overflows. You might have an integer4 variable that needs to be integer8 or real4 needs to be real8. Try adding the flags “-i8 -r8” to change the default kind to the larger data types to see if that helps.

If not, then I’ll need to see a reproducing example to determine the issue.

• Mat

Hi Mat and thanks for the quick reply,

Today we finally figure out what the problem was.
To avoid useless calculations we use 2 “if” statements inside our openACC regions. We have one DATA region and every N steps we copy our data back to the host. The first step of those N steps we must do the calculations inside those two IF statements i mentioned above. But in the remaining N-1 steps we don’t want this calculations. The first if statement is inside a PARALLEL region and the second IF statement is inside a KERNEL region. Here is the structrure of our code

``````!\$acc data copy ARRAYS

do while (for N steps)

!\$acc parallel
if (flag1) then
CALCULATIONS that i want only for the first step
endif

MORE CALCULATIONS
!\$acc end parallel

!\$acc kernels
if (flag2) then
CALCULATIONS that i want only for the first step
endif
!\$acc end kernels

!\$acc parallel
MORE CALCULATIONS
!\$acc end parallel

enddo
!\$acc end data
``````

If we use those IF statements and large ARRAYS we get NaN as results.(If we don’t use them we get the correct results but 2X slower execution).

Can you guess why is that happening? Is it IFLAG1 and IFLAG2 the problem? I mean that it is possible that all the threads of the grid do not see the correct values of these scalar variables.

Thanks,
Sotiris

Hi Sotiris,

By using the “parallel” construct, you are defining that everything in this region will be moved over the device. Also, it’s work shared, so unless you define using the “loop” directive to tell the compiler how you want the work divided, all thread in the region will execute the same code. With the “kerenl” construct, the compiler figures out the best way to divide up the work. (FYI, this article may be useful http://www.pgroup.com/lit/articles/insider/v4n2a1.htm in understanding the differences between the two).

Without seeing the code, I can’t be sure if this is the source of the NaNs, but it is possible. Do you get wrong answers if you switch to using just “kernels”?

• Mat

Hi Mat and thank you

I did a couple of experiments last week.

1)First, we solve the NaN problem by putting the IF statement out of our parallel region.

``````if(flag) then
!\$acc parallel
...
!\$acc end parallel
endif
``````

Why this was happenning? because of my “FLAG”? Every thread has its own copy of FLAG variable?

2)I have 2 adjacent loops inside this parallel region and in the first loop i use the reduction clause and i need the results in the second loop.

``````msys = 0
tmpx = 0
tmpy = 0
tmpz = 0

!\$acc loop vector reduction(+:msys,tmpx,tmpy,tmpz)
do i=1,N
msys = msys + m(i)
tmpx = tempx +vx(i)
...
enddo

!\$acc loop gang vector
do i=1,N
vxb(i) = vx(i) + tmpx/msys
vyb(i) = vy(i) + tmpy/msys
...
enddo
``````

First of all, why do i need VECTOR clause in the first loop? (i get wrong results with GANG VECTOR)Because of the reduction?
Second,is it possible to take different results in two different execution of my program? Because in your article you say that there is a barrier at the end of the parallel region, not at the end of the first loop. So i believe that a random thread has not the correct values for the calculations in the second loop(for example: not correct value of the msys variable).

3)When i change my parallel region into kernel region i have another problem.From Nvidia Visual profiler i can see that there is a communication between host and device when my program reaches that region and i can’t figure out the reason(is it because of the reduction?).I have a copy from host to device and then after the loop back to the host.With the parallel construct i don’t see that communication and i have better time results. why is that happening?

Sotiris

Why this was happenning? because of my “FLAG”? Every thread has its own copy of FLAG variable?

I would need a reproducing example to tell why. Each thread would get it’s own copy of flag1, but they should all be initialized to the same value. Most likely something else is the cause and unrelated to the if statement itself, but I can’t tell what that is from what you have posted.

First of all, why do i need VECTOR clause in the first loop? (i get wrong results with GANG VECTOR)Because of the reduction?

Is this a typo in your post or a typo in your program?

tmpx = tempx +vx(i)

If this is directly from your program, then this could be source of your issues. “tempx” may not be initialized. “tmpx” would need a last value causing the loop to not be paralleizable and is probably why you need to use a “vector” clause to force parallization.

3)When i change my parallel region into kernel region i have another problem.From Nvidia Visual profiler i can see that there is a communication between host and device when my program reaches that region and i can’t figure out the reason(is it because of the reduction?).I have a copy from host to device and then after the loop back to the host.With the parallel construct i don’t see that communication and i have better time results. why is that happening?

It could be the result of the reduction since it needs to be passed between the kernels.

• Mat