Code finishes without thread execution

Hello
I have a question with the following code. When I run it the device kernel does not execute. It should return an array UIPd(3*L) full of number data. But it seems that when UIPd(i1),UIPd(i2),UIPd(i3) is calculated with less terms in the equation at the end of the device code, it executes fine… though i do need the whole equation. Any ideas on what I am doing wrong? Thank you for your time. I really appreciate it.

module Subroutines 
 use cudafor
 contains
!C=======================================================================
      attributes(global) subroutine INTDONE(CX,CY,X,Y,NOE,NEPT,EX,EY,FI,DFI,GI,OME,VEXAI,NOIC,QCELL,UIP,IPFLAGS,N,NEX,NE,L,T,XNU,E,Q,NGAUSS,NCELLS)

      IMPLICIT NONE
!      IMPLICIT DOUBLE PRECISION (A-H)
!      IMPLICIT DOUBLE PRECISION (O-Z)

!C----COMPUTE THE INTERNAL POINTS DISPLACEMENT FOR ONE POINT

      Integer, value ::  N,NEX,NE,L,NCELLS,NGAUSS
      Integer  IPFLAGS(400)

      REAL*8, value ::  T,XNU,E,Q

      REAL*8 GI(10),OME(10),X(N),Y(N),EX(NEX),EY(NEX)
      REAL*8 DFI(3*N+3*NCELLS),FI(9*NE)
      REAL*8 QCELL(NCELLS,3),UIP(3*L)
      Integer NEPT(NE,2),VEXAI(NE,2)
      REAL*8 CX(L),CY(L) ! ,HW(3*L,9)
      REAL*8 NOE(NE,3),NOIC(NCELLS,4)

!C-------LOCAL DIMENSIONS NEED TO MAKE THESE UNIQUE
!      DIMENSION HW(3,9)
      REAL*8 HW11,HW12,HW13,HW14,HW15,HW16,HW17,HW18,HW19,HW21,HW22,HW23
      REAL*8 HW24,HW25,HW26,HW27,HW28,HW29,HW31,HW32,HW33,HW34
      REAL*8 HW35,HW36,HW37,HW38,HW39
!      DIMENSION GW(3,9)
      REAL*8 GW11,GW12,GW13,GW14,GW15,GW16,GW17,GW18,GW19,GW21,GW22,GW23
      REAL*8 GW24,GW25,GW26,GW27,GW28,GW29,GW31,GW32,GW33,GW34
      REAL*8 GW35,GW36,GW37,GW38,GW39
!      DIMENSION QL(3)
      REAL*8 QL1,QL2,QL3
!      DIMENSION U(9),P(9)
      REAL*8 U1,U2,U3,U4,U5,U6,U7,U8,U9,P1,P2,P3,P4,P5,P6,P7,P8,P9
      REAL*8 V11,V12,V13,V21,V22,V31,V32,P11,P12,P13,P21,P22,P23,P31,P32,P33
      REAL*8 U11,U12,U13,U21,U22,U23,U31,U32,U33,UIP1,UIP2,UIP3
      REAL*8 QLD1,QLD2,QLD3,XP,YP,X1,X2,X3,Y1,Y2,Y3,F1,F2,F3,AA,BB,CC,DD
      REAL*8 GIN, G, D, XL, DE, R, R1, R2, XCO, YCO, SUBL, ETAA, ETAB, ASUB, BSUB, EJAC
      REAL*8 DRN, DX, DY, EN1, EN2, Z, A, B, ZK1, PI, Z1, TT, TTT, W1, BESI, BESI1, ZZ
      REAL*8 BESK0, BESK, BESK1, EXBESK, EXBESK1, XSQ
      Integer IX,tx,NSUB,N1,N2,N3,ISUB,TELLESJ,i1,i2,i3,I,J,JJJ


!C--IDENTIFYING WHICH THREAD TO WORK ON

      IX = threadIdx%x + (blockIdx%x-1) * blockdim%x
      i1=3*IX-2
      i2=3*IX-1
      i3=3*IX

      if(IX <= L) then

        PI=3.14159265359 !(CHECK)

!C---- CX , CY
        XP=CX(IX)
        YP=CY(IX)

!C----LOOP ON THE BOUNDARY ELEMENTS (START HERE, LOOP ON NE)
        DO J=1,NE
      HW11=0.0
      HW12=0.0
      HW13=0.0
      HW14=0.0
      HW15=0.0
      HW16=0.0
      HW17=0.0
      HW18=0.0
      HW19=0.0
      HW21=0.0
      HW22=0.0
      HW23=0.0
      HW24=0.0
      HW25=0.0
      HW26=0.0
      HW27=0.0
      HW28=0.0
      HW29=0.0
      HW31=0.0
      HW32=0.0
      HW33=0.0
      HW34=0.0
      HW35=0.0
      HW36=0.0
      HW37=0.0
      HW38=0.0
      HW39=0.0
     
!C----ELEMENT NODES
        N1=NOE(J,1)
        N2=NOE(J,2)
        N3=NOE(J,3)

!C----ELEMENT EXTREAM POINTS COORDINATS
        X1=EX(NEPT(J,1))
        Y1=EY(NEPT(J,1))
        X3=EX(NEPT(J,2))
        Y3=EY(NEPT(J,2))
!C----ELEMENT MID POINT = MID NODE COORDINATES
        X2=X(N2)
        Y2=Y(N2)

!C----COMPUTE HW MATRIX

!C----THIS SUBROUTINE COMPUTES THE COEFFICIENTS A,B,C,D 
!C----FOR THE X AND Y DRIVATIVE W.R.T. EXAI 
      
      AA=2*X1/(-1.D0*(-1.D0-1.D0))+2*X2/(-1.D0*1.D0)+2*X3/(1.D0*(1.D0--1.D0))

      BB=-1.D0*X1/(-1.D0*(-1.D0-1.D0))-(-1.D0+1.D0)*X2/(-1.D0*1.D0)--1.D0*X3/(1.D0*(1.D0--1.D0))

      CC=2*Y1/(-1.D0*(-1.D0-1.D0))+2*Y2/(-1.D0*1.D0)+2*Y3/(1.D0*(1.D0--1.D0))

      DD=-1.D0*Y1/(-1.D0*(-1.D0-1.D0))-(-1.D0+1.D0)*Y2/(-1.D0*1.D0)--1.D0*Y3/(1.D0*(1.D0--1.D0))

      D=E*T**3/(12*(1-XNU*XNU))
      XL=SQRT(10.)/T
      DE=8*PI*D

!C----FORM THE ELEMENT SUB ELEMENT NUMBER
!      CALL CALCNSUB(XP,YP,X1,Y1,X2,Y2,X3,Y3,NSUB)
      NSUB=10

!C----SUB ELEMENT LENGTH
      SUBL=2./NSUB

!C----LOOP ON THE SUB ELEMENTS
      DO ISUB=1,NSUB

      ETAA=-1.+SUBL*(ISUB-1)
      ETAB=ETAA+SUBL

      ASUB=0.5*(ETAB-ETAA)
      BSUB=0.5*(ETAB+ETAA)
      EJAC=ASUB

      DO I=1,NGAUSS

!C----THE NEW LOACTION FOR GAUSS POINT
      GIN=GI(I)*ASUB+BSUB

!C----SHAPE FUNCTIONS FOR COMPUTING GEOMETRIC COORDINATES
!C----NOTE HERE WE SHOULD USE THE CONTINOUS SHAPE FUNCTIONS

      F1=GIN*(GIN-1.D0)/(-1.D0*(-1.D0-1.D0))
      F2=(GIN--1.D0)*(GIN-1.D0)/(-1.D0*1.D0)
      F3=GIN*(GIN--1.D0)/(1.D0*(1.D0--1.D0))


!C----THE GEOMETRIC COORDINATES FOR GAUSS POINTS
      XCO=X1*F1+X2*F2+X3*F3
      YCO=Y1*F1+Y2*F2+Y3*F3

!C----X AND Y DRIVATIVES W.R.T. EXAI
      DX=AA*GIN+BB
      DY=CC*GIN+DD

!C----THE GAUSS JACOBIAN
      G=SQRT(DX*DX+DY*DY)

      R=SQRT((XCO-XP)**2+(YCO-YP)**2)
      R1=(XCO-XP)/R
      R2=(YCO-YP)/R
      
      EN1=DY/G
      EN2=-DX/G
      DRN=EN1*R1+EN2*R2

!C----THE SHAPE FUNCTIONS WHICH WILL BE USED IN THE KERNEL INTEGRATION
!C----NOTE HERE WE SHOULD USE THE ELEMENT CASE SHAPE FUNCTION

      F1=GIN*(GIN-VEXAI(J,1))/(VEXAI(J,2)*(VEXAI(J,2)-VEXAI(J,1)))
      F2=(GIN-VEXAI(J,2))*(GIN-VEXAI(J,1))/(VEXAI(J,2)*VEXAI(J,1))
      F3=GIN*(GIN-VEXAI(J,2))/(VEXAI(J,1)*(VEXAI(J,1)-VEXAI(J,2)))

!C----HW ELEMENTS

      D=E*T**3/(12*(1-XNU*XNU))
      XL=SQRT(10.)/T
      DE=8*PI*D
      Z=XL*R

!C=======================================================================
!C --BESSEL FUNCTIONS
!C=======================================================================
!      FUNCTION BESK0(X)
!C----CALCULATES BESSEL FUNCTION K0

       IF(Z.LE.2.) THEN
          TT = Z/3.75
          TTT = TT*TT
          BESI = (((((0.0045813*TTT + 0.0360768)*TTT + 0.2659732)*TTT +1.2067492)*TTT + 3.0899424)*TTT + 3.5156229)*TTT + 1.
          Z1 = Z/2.
          ZZ = Z1*Z1
          BESK0= (((((0.00000740*ZZ + 0.00010750)*ZZ + 0.00262698)*ZZ +0.03488590)*ZZ + 0.23069756)*ZZ + 0.42278420)*ZZ -0.57721566 - LOG(Z1)*BESI
         ELSE
          W1 = 2./Z
          XSQ = SQRT(Z)
          EXBESK =((((((0.00053208*W1  - 0.00251540)*W1 +0.00587872)*W1 - 0.01062446)*W1 + 0.02189568)*W1 - 0.07832358)*W1 + 1.25331414)/XSQ
          IF(Z .GT. 86.)THEN
                  BESK = 0.
          ENDIF
          BESK0= EXP(-Z)*EXBESK
        ENDIF
!        END
!C=======================================================================
!        FUNCTION BESK1(X)
!C----CALCULATES BESSEL FUNCTION K1

       IF(Z.LE.2.) THEN
          TT = Z/3.75
          TTT = TT*TT
          BESI1 = ((((((.00032411*TTT + .00301532)*TTT + .02658733)*TTT +.15084934)*TTT + .51498869)*TTT + .87890594)*TTT + .5 )*Z
          Z1 = Z/2.
          ZZ = Z1*Z1
          BESK1 = ((((((-.00004686*ZZ - .00110404)*ZZ - .01919402)*ZZ -.18156897)*ZZ - .67278579)*ZZ + .15443144)*ZZ + 1.)/Z + LOG(Z1)*BESI1
       ELSE
          W1 = 2./Z
          XSQ =SQRT(Z)
          EXBESK1 =((((((-.00068245*W1  + .00325614)*W1 -.00780353)*W1 +.01504268)*W1 - .03655620)*W1 + .23498619)*W1 + 1.25331414)/XSQ
          IF(Z .GT. 86.)THEN
                   BESK1 = 0.
          ENDIF
          BESK1 = EXP(-Z)*EXBESK1
        ENDIF
!        END
!C=======================================================================
      A=BESK0+(BESK1-1/Z)*2/Z
      B=BESK0+(BESK1-1/Z)/Z
      ZK1=BESK1

!C----THE P ELEMENTS
      P11=-((4*A+2*Z*ZK1+1-XNU)*(DRN+R1*EN1)+(4*A+1+XNU)*R1*EN1-2*(8*A+2*Z*ZK1+1-XNU)*R1*R1*DRN)/(4*PI*R)
      P12=-((4*A+2*Z*ZK1+1-XNU)*(R2*EN1)+(4*A+1+XNU)*R1*EN2-2*(8*A+2*Z*ZK1+1-XNU)*R1*R2*DRN)/(4*PI*R)
      P13=(B*EN1-A*R1*DRN)*XL*XL/(2*PI)
      P21=-((4*A+2*Z*ZK1+1-XNU)*(R1*EN2)+(4*A+1+XNU)*R2*EN1-2*(8*A+2*Z*ZK1+1-XNU)*R2*R1*DRN)/(4*PI*R)
      P22=-((4*A+2*Z*ZK1+1-XNU)*(DRN+R2*EN2)+(4*A+1+XNU)*R2*EN2-2*(8*A+2*Z*ZK1+1-XNU)*R2*R2*DRN)/(4*PI*R)
      P23=(B*EN2-A*R2*DRN)*XL*XL/(2*PI)
      P31=-((2*(1+XNU)/(1-XNU)*LOG(Z)-1)*EN1+2*R1*DRN)*(1-XNU)/(8*PI)
      P32=-((2*(1+XNU)/(1-XNU)*LOG(Z)-1)*EN2+2*R2*DRN)*(1-XNU)/(8*PI)
      P33=-DRN/(2*PI*R)

!      END
!C=======================================================================

!C----FRIST NODE
      HW11=HW11+P11*F1*OME(I)*G*EJAC
      HW12=HW12+P12*F1*OME(I)*G*EJAC
      HW13=HW13+P13*F1*OME(I)*G*EJAC
      HW21=HW21+P21*F1*OME(I)*G*EJAC
      HW22=HW22+P22*F1*OME(I)*G*EJAC
      HW23=HW23+P23*F1*OME(I)*G*EJAC
      HW31=HW31+P31*F1*OME(I)*G*EJAC
      HW32=HW32+P32*F1*OME(I)*G*EJAC
      HW33=HW33+P33*F1*OME(I)*G*EJAC

!C----SECOND NODE
      HW14=HW14+P11*F2*OME(I)*G*EJAC
      HW15=HW15+P12*F2*OME(I)*G*EJAC
      HW16=HW16+P13*F2*OME(I)*G*EJAC
      HW24=HW24+P21*F2*OME(I)*G*EJAC
      HW25=HW25+P22*F2*OME(I)*G*EJAC
      HW26=HW26+P23*F2*OME(I)*G*EJAC
      HW34=HW34+P31*F2*OME(I)*G*EJAC
      HW35=HW35+P32*F2*OME(I)*G*EJAC
      HW36=HW36+P33*F2*OME(I)*G*EJAC

!C----THIRD NODE
      HW17=HW17+P11*F3*OME(I)*G*EJAC
      HW18=HW18+P12*F3*OME(I)*G*EJAC
      HW19=HW19+P13*F3*OME(I)*G*EJAC
      HW27=HW27+P21*F3*OME(I)*G*EJAC
      HW28=HW28+P22*F3*OME(I)*G*EJAC
      HW29=HW29+P23*F3*OME(I)*G*EJAC
      HW37=HW37+P31*F3*OME(I)*G*EJAC
      HW38=HW38+P32*F3*OME(I)*G*EJAC
      HW39=HW39+P33*F3*OME(I)*G*EJAC

      END DO
      
      END DO

!C----FRIST NODE
!      HW(3*IX-2,1)=HW11
!      HW(3*IX-2,2)=HW12
!      HW(3*IX-2,3)=HW13
!      HW(3*IX-1,1)=HW21
!      HW(3*IX-1,2)=HW22
!      HW(3*IX-1,3)=HW23
!      HW(3*IX,1)=HW31
!      HW(3*IX,2)=HW32
!      HW(3*IX,3)=HW33

!C----SECOND NODE
!      HW(3*IX-2,4)=HW14
!      HW(3*IX-2,5)=HW15
!      HW(3*IX-2,6)=HW16
!      HW(3*IX-1,4)=HW24
!      HW(3*IX-1,5)=HW25
!      HW(3*IX-1,6)=HW26
!      HW(3*IX,4)=HW34
!      HW(3*IX,5)=HW35
!      HW(3*IX,6)=HW36

!C----THIRD NODE
!      HW(3*IX-2,7)=HW17
!      HW(3*IX-2,8)=HW18
!      HW(3*IX-2,9)=HW19
!      HW(3*IX-1,7)=HW27
!      HW(3*IX-1,8)=HW28
!      HW(3*IX-1,9)=HW29
!      HW(3*IX,7)=HW37
!      HW(3*IX,8)=HW38
!      HW(3*IX,9)=HW39

!C----ASSIGN THE ELEMENT TRACTIONS AND DISPLACEMENTS TO THEIR VECTORS
        U1=DFI(3*N1-2)
        U2=DFI(3*N1-1)
        U3=DFI(3*N1)

        U4=DFI(3*N2-2)
        U5=DFI(3*N2-1)
        U6=DFI(3*N2)

        U7=DFI(3*N3-2)
        U8=DFI(3*N3-1)
        U9=DFI(3*N3)

!C----COMPUTE THE INTERNAL POINT DISPLACEMENT

     UIP1=UIP1-HW11*U1-HW12*U2-HW13*U3-HW14*U4-HW15*U5-HW16*U6-HW17*U7-HW18*U8-HW19*U9
     UIP2=UIP2-HW21*U1-HW22*U2-HW23*U3-HW24*U4-HW25*U5-HW26*U6-HW27*U7-HW28*U8-HW29*U9
     UIP3=UIP3-HW31*U1-HW32*U2-HW33*U3-HW34*U4-HW35*U5-HW36*U6-HW37*U7-HW38*U8-HW39*U9

   END DO  !BOUNDARY ELEMENTS


   UIP(i1)=UIP1
   UIP(i2)=UIP2
   UIP(i3)=UIP3

      ENDIF

      END subroutine INTDONE
!C============================================================================
!C============================================================================
end module Subroutines
!C============================================================================
!C --INTERNAL POINT DISPLACEMENT/ Compile with pgfortran IntdoneGPUReady_2.CUF
!C============================================================================


      Program INTDISPL
      use Subroutines
      use cudafor

! C----COMPUTE THE INTERNAL POINTS DISPLACEMENTS

      IMPLICIT NONE

      CHARACTER*15 INFILE1,OUTFILE1,OUTFILE2,UFILE,TFILE,AIPFILE,FLAGSFILE,TITLE1,TITLE2

      Integer  I,N,NEX,NE,L,NCELLS,NCOLS,NBEAMS,NGAUSS,ISOLVER,JJ,NI,NFLU,NFLT,NO,NFLAIP,NFLIPU,NFLAGS
      Integer  IPFLAGS(400)
      REAL*8  T,XNU,E,Q,istat,time1,time2
      REAL*8, ALLOCATABLE :: X(:),Y(:)
      REAL*8, ALLOCATABLE :: EX(:),EY(:)
      REAL*8, ALLOCATABLE :: FI(:),DFI(:)
      REAL*8, ALLOCATABLE :: NOE(:,:)
      Integer, ALLOCATABLE :: NEPT(:,:),VEXAI(:,:)
      REAL*8, ALLOCATABLE :: CX(:),CY(:),VQA(:)
      REAL*8, ALLOCATABLE :: NOIC(:,:)      
      REAL*8, ALLOCATABLE :: QCELL(:,:)
      REAL*8, ALLOCATABLE :: UIP(:),CHECK(:)

      REAL*8 , device , allocatable :: UIPd(:),CHECKD(:)
      REAL*8 , device , allocatable :: Xd(:),Yd(:),EXd(:),EYd(:),CXd(:),CYd(:),VQAd(:)
      REAL*8 , device , allocatable :: FId(:),DFId(:),NOEd(:,:),NOICd(:,:),QCELLd(:,:)
      Integer , device , allocatable :: NEPTd(:,:),VEXAId(:,:)
      Integer , device :: IPFLAGSd(400)  


!C--------READING THE PREVIOUS RESULTS FROM THE U AND T FILES

      REAL*8 , device :: GId(10),OMEd(10)
      REAL*8 :: GI(10),OME(10)
      DATA GI/0.148874338981631,-0.148874338981631, &
              0.433395394129247,-0.433395394129247, &
              0.679409568299024,-0.679409568299024, &
              0.865063366688985,-0.865063366688985, &
              0.973906528517172,-0.973906528517172/
      
      DATA OME/0.295524224714753,0.295524224714753, &
               0.269266719309996,0.269266719309996, &
               0.219086362515982,0.219086362515982, &
               0.149451349150581,0.149451349150581, &
               0.066671344308688,0.066671344308688/
      NI=1
      NFLU=2
      NFLT=3
      NO=4
      NFLAIP=5
      NFLIPU=6
      NFLAGS=7

      WRITE(*,'(A,$)')' NAME OF .IN FILE: '
      READ(*,*) INFILE1
      WRITE(*,'(A,$)')' NAME OF UFILE: '
      READ(*,*) UFILE
      WRITE(*,'(A,$)')' NAME OF TFILE: '
      READ(*,*) TFILE
      WRITE(*,'(A,$)')' NAME OF OUTPUT FILE: '
      READ(*,*) OUTFILE1
      WRITE(*,'(A,$)')' NAME OF AIPFILE: '
      READ(*,*) AIPFILE
      WRITE(*,'(A,$)')' NAME OF OUTPUT FILE 2: '
      READ(*,*) OUTFILE2
      WRITE(*,'(A,$)')' NAME OF FLAGSFILE: '
      READ(*,*) FLAGSFILE

      OPEN(UNIT=NI,FILE=INFILE1,STATUS='OLD')
      OPEN(UNIT=NFLU,FILE=UFILE,STATUS='OLD')
      OPEN(UNIT=NFLT,FILE=TFILE,STATUS='OLD')
      OPEN(UNIT=NO,FILE=OUTFILE1,STATUS='NEW')
      OPEN(UNIT=NFLAIP,FILE=AIPFILE,STATUS='OLD')
      OPEN(UNIT=NFLIPU,FILE=OUTFILE2,STATUS='NEW')
      OPEN(UNIT=NFLAGS,FILE=FLAGSFILE,STATUS='OLD')

      WRITE(NO,*)'READING BOUNDARY RESULTS FROM U AND T FILES'

!c-----READ FLAGS FILE
      READ(NFLAGS,*) (IPFLAGS(I),I=1,400) 

!c-----READ .IN FILE
      READ(NI,' (A) ') TITLE1           
      READ(NI,' (A) ') TITLE2           
      READ(NI,*) NGAUSS           
      READ(NI,*) T,XNU,E,Q
      READ(NI,*) ISOLVER
      READ(NI,*) NEX,N,NE,L,NCELLS,NCOLS

      WRITE(NO,*)'.IN FILE:', NGAUSS,T,XNU,E,Q,NEX,N,NE,L,NCELLS,NCOLS

!c-----ALLOCATE ON HOST
      Allocate (X(N),Y(N),EX(NEX),EY(NEX),NOE(NE,3))
      Allocate (NEPT(NE,2),VEXAI(NE,2),NOIC(NCELLS+NCOLS,4),QCELL(NCELLS+NCOLS,3))
      Allocate (DFI(3*N+3*NCOLS),FI(9*NE))

!c-----ALLOCATE ON DEVICE
      Allocate (Xd(N),Yd(N),EXd(NEX),EYd(NEX),NOEd(NE,3))
      Allocate (NEPTd(NE,2),VEXAId(NE,2),NOICd(NCELLS+NCOLS,4),QCELLd(NCELLS+NCOLS,3))
      Allocate (DFId(3*N+3*NCOLS),FId(9*NE))

!c-----CONTINUE READ .IN FILE
      READ(NI,*) (EX(I),EY(I),I=1,NEX)
      READ(NI,*) (X(I),Y(I),I=1,N)
      READ(NI,*) (NOE(I,1),NOE(I,2),NOE(I,3),VEXAI(I,1),VEXAI(I,2),I=1,NE)
      READ(NI,*) (NEPT(I,1),NEPT(I,2),I=1,NE)

      WRITE(NO,*)'!c-----CONTINUE', EX,EY,X,Y,NOE,VEXAI,NEPT

      DO JJ=1,3*N
         READ(NFLU,*) DFI(JJ)
      ENDDO  

      WRITE(NO,*)'DFI:', DFI

      DO JJ=1,9*NE
         READ(NFLT,*) FI(JJ)
      ENDDO

      WRITE(NO,*)'FI:', FI

      NBEAMS = 0
      NCELLS = 0
      READ(NFLT,*)	NCELLS
	
      DO JJ=1,NCELLS
	   READ(NFLT,*) NOIC(JJ,1)
	   READ(NFLT,*) NOIC(JJ,2)
	   READ(NFLT,*) NOIC(JJ,3)
	   READ(NFLT,*) NOIC(JJ,4)
      ENDDO

      WRITE(NO,*)'NOIC:', NOIC
      
      DO JJ=1,NCELLS
           READ(NFLT,*) QCELL(JJ,1)
           READ(NFLT,*) QCELL(JJ,2)
           READ(NFLT,*) QCELL(JJ,3)
      ENDDO

      WRITE(NO,*)'QCELL:',QCELL

      CLOSE(NFLU)
      CLOSE(NFLT)

      WRITE(NO,*)'----------------------------------------------------'
      WRITE(NO,*)'ADDITIONAL INTERNAL POINTS COORDINATES:'

      READ(NFLAIP,*) L

      ALLOCATE(CX(L),CY(L),VQA(L),UIP(3*L))
      ALLOCATE(CXd(L),CYd(L),VQAd(L),UIPd(3*L))
      WRITE(NO,114) L
  114 FORMAT('THE NUMBER OF ADDITIONAL INTERNAL POINTS=',I4)

      WRITE(NO,*)'ADDITIONAL INTERNAL POINTS COORDINATES:'
      WRITE(NO,*)' NO     X         Y       Q-ABOVE'    
      READ(NFLAIP,*) (CX(I),CY(I),VQA(I),I=1,L)      
      WRITE(NO,115) (I,CX(I),CY(I),VQA(I),I=1,L)      
  115 FORMAT(I3,3F10.4)
      WRITE(NO,*)'----------------------------------------------------'
      CLOSE(NFLAIP)

!C=======================================================================

!C----EQUATE & INITIALIZE THE VALUES

      GId=GI 
      OMEd=OME
      Xd=X
      Yd=Y
      EXd=EX
      EYd=EY
      CXd=CX
      CYd=CY
      FId=FI
      DFId=DFI
      NOEd=NOE
      NOICd=NOIC
      QCELLd=QCELL
      NEPTd=NEPT
      VEXAId=VEXAI
      UIPd=0.0

!C-------COMPUTE ON GPU
      call CPU_Time(time1)
      istat=cudaDeviceSynchronize() 
CALL INTDONE<<<((REAL(L/512))+1),(512)>>>(CXd,CYd,Xd,Yd,NOEd,NEPTd,EXd,EYd,FId,DFId,GId,OMEd,VEXAId,NOICd,QCELLd,UIPd,IPFLAGSd,N,NEX,NE,L,T,XNU,E,Q,NGAUSS,NCELLS)
!     dim3(ceiling(REAL(L/512))+1,1,1) ,dim3(512,1,1)
      istat=cudaDeviceSynchronize()
      call CPU_Time(time2)
      open (10000, file='TimeSpentGPURegular.txt', status='unknown')
      write(10000,*) 'Time spent total Regular : ', time2-time1
      UIP=UIPd


!C-------WRITE VALUES OF INT PTS DISPL TO THE .IPU FILE
      WRITE(NFLIPU,2) (I,UIP(3*I-2),UIP(3*I-1),UIP(3*I),I=1,L)
      WRITE(NO,*) UIP

  1   FORMAT('PT',I3,'  Rx=',E17.8,'  Ry=',E17.8,'  U3=',E17.8)
  2   FORMAT(I10,'  ',E20.5,'  ',E20.5,'  ',E20.5)

      END Program

required extra files are here, although i think you might spot my problem without them: https://www.dropbox.com/sh/w40ot94aa6vjkis/AACOmyyARpoP5Y1w6UNW7-5ja?dl=0

Hi Torkin,

I tired your code on a few different machines and see a couple different issues.

On a few systems, mainly Linux, the code ran fine and got correct answers. On others, the code ran, but got NaNs. I traced this to some uninitialized variables, UIP1, UIP2, and UIP3. On some systems, the memory happened to be zero and why it worked. The fix is to explicitly initialize these variables to zero.

I also saw one system with a launch error caused by the use of too many registers. The fix here would be to use the “-Mcuda:maxregcount:63” or reduce the number of threads per block from 512 to 256. I’d recommend reducing the number of threads.

Note, it’s a good idea to check if your kernel errored by calling “cudaGetLastError”. It’s how I found the resister issue.

For example:

       call mykernel<<<...
       istat = cudaGetLastError()
       if (istat .ne. 0) then
          print *, "ERROR: ", cudaGetErrorString(istat)
       endif

Hope this helps,
Mat

Thank you Mat, I lowered threads per block and it works great now. I also found that 64 performed better than 256 as threads per block go, I think this is always case specific.

Tork