# Inconsistent results on GPU

Hi,

I have been having trouble with a portion of code that I am trying to port to CUDA. The module is part of a smoothed particle hydrodynamics code for solid mechanics.

The subroutine is used to calculate an “artificial stress” that is used to stabilize particles that are in tension. The algorithm works very well on the CPU, but I just cant seem to get it to work properly on the GPU. On the CPU, the algorithm looks like this:

SUBROUTINE art_stress_simple(itimestep,ntotal,hsml,mass,niac,rho,pair_i,pair_j,w,dwdx,F_int,countiac,sig)

IMPLICIT NONE
INCLUDE 'param90.inc'

INTEGER:: ntotal, niac, pair_i(max_interaction),pair_j(max_interaction)
INTEGER:: i, j, k, d, n, m, countiac(maxn), itimestep
DOUBLE PRECISION:: hsml(maxn), mass(maxn), x(dim,maxn),vx(dim,maxn),              &
rho(maxn), w(max_interaction), dvxdt(dim,maxn),                &
dwdx(dim,max_interaction), w_deltap
DOUBLE PRECISION:: dx, epsilon_stress, r, dr, Ra(dim,dim,maxn), Rb(dim,dim,maxn), &
sig(dim,dim,maxn), rr, f_abi, F_abj
DOUBLE PRECISION:: F_int(dim,maxn), F1_int, F2_int, F3_int, deltap, n_stress

F_int    = 0.0
deltap   = 0.005

epsilon_stress = 0.3
n_stress = 6.0

CALL simple_kernel(deltap,hsml(1),w_deltap)

!     Calculate SPH sum for artificial stress

DO k=1,niac
i = pair_i(k)
j = pair_j(k)

f_abi = w(k)/w_deltap
f_abi = f_abi**n_stress

f_abj = w(k)/w_deltap
f_abj = f_abj**n_stress

!       For the ith particle in the pair

IF (sig(1,1,i) .gt. 0.0) THEN
Ra(1,1,i) = -epsilon_stress*sig(1,1,i)
ELSE
Ra(1,1,i) = 0.0
END IF
IF (sig(2,2,i) .gt. 0.0) THEN
Ra(2,2,i) = -epsilon_stress*sig(2,2,i)
ELSE
Ra(2,2,i) = 0.0
END IF
IF (sig(3,3,i) .gt. 0.0) THEN
Ra(3,3,i) = -epsilon_stress*sig(3,3,i)
ELSE
Ra(3,3,i) = 0.0
END IF

!       For the jth particle in the pair

IF (sig(1,1,j) .gt. 0.0) THEN
Rb(1,1,j) = -epsilon_stress*sig(1,1,j)
ELSE
Rb(1,1,j) = 0.0
END IF
IF (sig(2,2,j) .gt. 0.0) THEN
Rb(2,2,j) = -epsilon_stress*sig(2,2,j)
ELSE
Rb(2,2,j) = 0.0
END IF
IF (sig(3,3,j) .gt. 0.0) THEN
Rb(3,3,j) = -epsilon_stress*sig(3,3,j)
ELSE
Rb(3,3,j) = 0.0
END IF

F1_int = (Ra(1,1,i)*f_abi/rho(i)**2.0 + Rb(1,1,j)*f_abj/rho(j)**2.0)*dwdx(1,k)

F2_int = (Ra(2,2,i)*f_abi/rho(i)**2.0 + Rb(2,2,j)*f_abj/rho(j)**2.0)*dwdx(2,k)

F3_int = (Ra(3,3,i)*f_abi/rho(i)**2.0 + Rb(3,3,j)*f_abj/rho(j)**2.0)*dwdx(3,k)

F_int(1,i) = F_int(1,i) + F1_int*mass(j)
F_int(1,j) = F_int(1,j) - F1_int*mass(i)

F_int(2,i) = F_int(2,i) + F2_int*mass(j)
F_int(2,j) = F_int(2,j) - F2_int*mass(i)

F_int(3,i) = F_int(3,i) + F3_int*mass(j)
F_int(3,j) = F_int(3,j) - F3_int*mass(i)
END DO

END SUBROUTINE art_stress_simple

SUBROUTINE SimpleKernel(r,hsml,w)

IMPLICIT NONE

INCLUDE 'param90.inc'

Real(fp_kind):: r, hsml, w, q, factor

q = r/hsml
w = 0.e0

factor = 3.e0/(2.e0*pi*hsml*hsml*hsml)
If (q.ge.0.and.q.le.1.e0) then
w = factor * (2./3. - q*q + q**3 / 2.)
Else If (q.gt.1.e0.and.q.le.2) then
w = factor * 1.e0/6.e0 * (2.-q)**3
Else
w = 0.0
End If

END SUBROUTINE SimpleKernel

The first thing that I figured needs to be different on the GPU is that Ra and Rb should be evaluated in one subroutine and then they should be used in an other routine. The GPU version looks like:

Module Module_Artificial_Stress

Use cudafor
Use Module_Precision
Use Module_Neighbors

Contains

Attributes(Global) Subroutine artificial_stress(sig,Ra1,Ra2,Ra3,nTotal)

Implicit None

Integer, Value:: nTotal
Integer, Device:: i
Real(fp_kind), Parameter:: AS_factor = 0.3
Real(fp_kind), Device:: sig(nTotal,Dim,Dim)
Real(fp_kind), Device:: Ra1(nTotal), Ra2(nTotal), Ra3(nTotal)

If (i >= 1 .and. i <= nTotal) Then
If (sig(i,1,1) > 0.0) Then
Ra1(i) = -AS_factor*sig(i,1,1)
End If
If (sig(i,2,2) > 0.0) Then
Ra2(i) = -As_factor*sig(i,2,2)
End If
If (sig(i,3,3) > 0.0) Then
Ra3(i) = -AS_factor*sig(i,3,3)
End If
End If

End Subroutine artificial_stress

Attributes(Global) Subroutine artificial_stress_force(Ra1,Ra2,Ra3,w,dwdx,hsml,rho,mass,Neib,F_int,nTotal)

Implicit None

Include "param90.inc"

Integer, Value:: nTotal
Integer:: i, j, k, d
Integer, Device, Intent(IN):: Neib(nTotal,nNeib_max)
Real(fp_kind), Parameter:: n_stress = 6.0
Real(fp_kind), Parameter:: delta_p = 0.005
Real(fp_kind), Device:: fab, w_delta_p, R1, R2, R3
Real(fp_kind), Device:: w(nTotal,nNeib_max), dwdx(nTotal,nNeib_max,Dim), hsml(nTotal), rho(nTotal), mass(nTotal)
Real(fp_kind), Device:: Ra1(nTotal), Ra2(nTotal), Ra3(nTotal)
Real(fp_kind), Device:: F_int(nTotal,Dim)

If (i >= 1 .and. i <= nTotal) Then

Call SimpleKernel(delta_p,hsml(i),w_delta_p)
Do k = 3, Neib(i,2) + 2
j = Neib(i,k)
fab = (w(i,k)/w_delta_p)**n_stress

R1 = Ra1(i)*fab/rho(i)**2.0 + Ra1(j)*fab/rho(j)**2.0
R2 = Ra2(i)*fab/rho(i)**2.0 + Ra2(j)*fab/rho(j)**2.0
R3 = Ra3(i)*fab/rho(i)**2.0 + Ra3(j)*fab/rho(j)**2.0

F_int(i,1) = F_int(i,1) + R1*mass(j)*dwdx(i,k,1)
F_int(i,2) = F_int(i,2) + R2*mass(j)*dwdx(i,k,2)
F_int(i,3) = F_int(i,3) + R3*mass(j)*dwdx(i,k,3)
End Do
End If

End Subroutine artificial_stress_force

Attributes(Device) SUBROUTINE SimpleKernel(r,hsml,w)

IMPLICIT NONE

INCLUDE 'param90.inc'

Real(fp_kind):: r, hsml, w, q, factor

q = r/hsml
w = 0.e0

factor = 3.e0/(2.e0*pi*hsml*hsml*hsml)
If (q.ge.0.and.q.le.1.e0) then
w = factor * (2./3. - q*q + q**3 / 2.)
Else If (q.gt.1.e0.and.q.le.2) then
w = factor * 1.e0/6.e0 * (2.-q)**3
Else
w = 0.0
End If

END SUBROUTINE SimpleKernel

End Module Module_Artificial_Stress

The problem is that on the GPU the results just do not line up with the expected results (the CPU algorithm gives the correct results). The results on the GPU are not consistent, the artificial stress is different each time the program is run (makes me think I am doing something wrong with reading and writing the same value across multiple threads).

Maybe I have done something fundamentally wrong in my GPU implementation? Does anything jump out as being wrong to anyone?

This is driving me a LITTLE crazy, I have been over the code a bunch of times, have re-written the artificial stress module from scratch a couple of times… very frustrated!

The rest of the code (if I do not calculate the artificial stress) works as it should on the GPU and the results fit perfectly with the CPU results.

Thank you for any help,

Kirk

Hi Kirk,

I guess I don’t quite understand the algorithm changes you made.

What happened to Rb in your GPU code? It seems to have disappeared and Ra is used in it’s place. Same with the F_int(j,…) values.

Also, do you really want to compute all values of “i”? In the CPU version you’re computing k=1,niac and then selective “i” values.

What I’d recommend is compile in emulation mode (-Mcuda=emu) and then run the program in the PGI debugger. Or you have Allinea DDT, you can debug on the device (with PGI 14.1 or later).

• Mat

Hi Mat,

Thank you for taking a look, looking back, I should have given more information about what I was doing differently on the GPU… I should have realized that I would need to go into some of the more intricate details of smoothed particle hydrodynamics to explain what I am trying to do.

I will have to try out the debugger in one of the latest compilers (I guess I should install 4.3 since this is the most up to date?)

As an aside, I will be at GTC next week, do you have some suggestions for hands on training sessions for us CUDA Fortran guys (I saw that you are given an intro class)?

Taker easy,

Kirk

As an aside, I will be at GTC next week, do you have some suggestions for hands on training sessions for us CUDA Fortran guys (I saw that you are given an intro class)?

I am teaching session S4800 from 4:00-5:30 on Tuesday. Actually, I’m more presenting the material. Greg Ruetsch who wrote the book “CUDA Fortran for Scientists and Engineers” put the material together. You’re welcome to come, though you might be beyond the intro level!

If you’re interested in OpenACC, I’m assisting Michael Wolfe with his tutorial on Wednesday 5-6:30.

I’ll be at the “ask the experts” hangout, #C, Tuesday 2-3 and Wednesday 12-1, working in the NVIDIA HPC booth Tuesday night 6-8, and the PGI Booth 7-8 on Wednesday night. I’m sure how much code review we could do, but we could try.

Oh, and I’m presenting the new SPEC ACCEL benchmark (http://www.spec.org/accel) at 10:30 on Wednesday (Session S4437) if your interested.

• Mat