copyin 2D failed

0: copyin 2D (dst=0x2e41c00, dpitch=3216, src=0xdf90b80, spitch=3216, width=8, h
eight=1) FAILED: 77(an illegal memory access was encountered)

anyone have seen this error before when calling device subroutine?

here is the code that produces this error:
attributes(global) subroutine multipleindices_kernel(valu, ncell)
use cudafor
implicit none
integer, value:: ncell ! ncell = 10
integer :: valu(ncell)
integer :: tempind(10)
integer:: i

tempind(1)=10
tempind(2)=3
tempind(3)=5
tempind(4)=7
tempind(5)=9
tempind(6)=2
tempind(7)=4
tempind(8)=6
tempind(9)=8
tempind(10)=1

i = (blockidx%x - 1 ) * blockdim%x + threadidx%x

if (i .LE. ncell) then
valu(tempind(i))= 1
endif


end subroutine

program test
use cudafor
!use test_2

integer:: ncell, i
integer, allocatable:: values(:)
integer, allocatable, device :: values_d(:)


ncell = 10

allocate(values(ncell), values_d(ncell))

do i=1,ncell
values(i) = i
enddo


values_d = values


call multipleindices_kernel<<< ncell/1024+1,1024 >>> (values_d, ncell)

values = values_d

write (,) values


end program test

thanks!
Dolf

Hi Dolf,

CUDA Fortran kernels need an interface, either implicit or explicit, especially when you’re passing in allocatable arrays. The simplest thing to do here is put “multipleindices_kernel” in a module then an implicit interface will get created.

Hope this helps,
Mat

Hi Matt,

I did research on the subject and I discovered that I need to put all kernels in a module. that solved the issue for this previous example. The real code I have, it is still giving me the same error even when I already have all kernels in the module. here is a portion of the code (original one is too long) I am providing here:

main program:

program test3

use kernels
use reyneq_Dev
use Q4_globals_Dev
use flows_Dev
use average_Dev
use filmtk_Dev
use sl_fthk_Dev
use bearing_Dev

nxmax = 402
nymax = 402
nx = nxmax
ny = nymax
allocate(hxlnewDev(nxmax,nymax),hxlDev(nxmax,nymax),hxrnewDev(nxmax,nymax),hxrDev(nxmax,nymax), &
		hydnewDev(nxmax,nymax),hyunewDev(nxmax,nymax),hydDev(nxmax,nymax),hyuDev(nxmax,nymax))
allocate(xrefDev(nxmax),yrefDev(nxmax),hDev(nxmax,nymax),hnewDev(nxmax,nymax),hxyDev(nxmax,nymax))
allocate(ztaDev(nxmax,nymax),etaDev(nxmax,nymax),bearxDev(nxmax,nymax), &
		 bearyDev(nxmax,nymax),pDev(nxmax,nymax),poldDev(nxmax,nymax))
allocate(awDev(nxmax,nymax),apDev(nxmax,nymax),aeDev(nxmax,nymax), &
         arDev(nxmax,nymax),asDev(nxmax,nymax),anDev(nxmax,nymax))
allocate(ajDev(nxmax),bjDev(nxmax),cjDev(nxmax),djDev(nxmax))

grid = dim3(ceiling(real(nx-1)/threads%x), &
			ceiling(real(ny-1)/threads%y),1)  !ny-1
call reyneq_p1_kernel<<<grid,threads>>>(nx,ny,nxmax,nymax,qnDev,pnDev,nter,iqpo,icoe,idisc, &
		pitDev,pirDev,gamaDev,d0Dev,siDev,dtDev,omega0Dev,t1Dev,t2Dev,t3Dev,t4Dev,&
		zdatDev,qndatDev,xrefDev,yrefDev,pDev,poldDev,hDev,hnewDev,awDev,aeDev,asDev,&
		anDev,arDev,apDev,bearxDev,bearyDev,hydnewDev,hyunewDev,hxlnewDev,hxrnewDev,ztaDev,etaDev)
		istat = cudaThreadSynchronize()
		if (istat .ne. 0) write(*,*) ' reyneq_p1 kernel error ',cudaGetErrorString(istat),istat

grid = dim3(ceiling(real(nx-1)/threads%x), &
			ceiling(real(ny-1)/threads%y),1)  !ny-1
call reyneq_p2_kernel<<<grid,threads>>>(nx,ny,nxmax,nymax,pDev,ajDev,bjDev,cjDev,djDev, &
		awDev,aeDev,asDev,anDev,apDev,arDev,akdDev,aknDev)
istat = cudaThreadSynchronize()
if (istat .ne. 0) write(*,*) ' reyneq_p2 kernel error ',cudaGetErrorString(istat)

grid = dim3(ceiling(real(nx-1)/threads%x), &
ceiling(real(ny-1)/threads%y),1)  !ny-1
call reyneq_p3_kernel<<<grid,threads>>>(nx,ny,nxmax,nymax,pDev,xrefDev,yrefDev, &
		awDev,apDev,aeDev,arDev,asDev,anDev,ajDev,bjDev,cjDev,djDev)

istat = cudaThreadSynchronize()
if (istat .ne. 0) write(*,*) ' reyneq_p3 kernel error ',cudaGetErrorString(istat)

end program test3

memory modules:

! ***************************************
!module Q4_globals_Dev
! ****************************************
use cudafor
!use Q4_sizes
!use Q4_globals
implicit none
!cuda device variables
integer :: istat
type(dim3) :: grid, threads
type(cudaDeviceProp) :: prop

!slider info
!xl = x length, yl = y length, zl=**, xg = **, xt=taper Length, ht = taper angle, rebase = base recess
real(8), device :: xlDev, ylDev, zlDev, xgDev, xtDev, htDev, rebaseDev, crownDev, camberDev, twistDev
!Grid
!==========
!integer, device :: nxDev,nx1Dev,nx2Dev,nx3Dev,nx4Dev
real(8), device, allocatable, dimension(:) :: xrefDev,xref1Dev,xref2Dev,xref3Dev,xref4Dev, &
yrefDev,yref1Dev,yref2Dev,yref3Dev,yref4Dev
!real(8), device, allocatable, dimension(:,:) :: bearxDev, bearx1Dev, bearx2Dev,bearx3Dev,&
! bearx4Dev, bearyDev, beary1Dev, beary2Dev, beary3Dev, beary4Dev

!Run Setup and Initial attitude
real(8), device :: skeDev, raDev, rpmDev, u0Dev, hmDev, hx0Dev, h0Dev, hsDev, hyDev, p0xlDev

!height data (say what hnew and h are)
real(8), device, allocatable, dimension(:,:) :: hDev, h1Dev, h2Dev, h3Dev, h4Dev, hnewDev, hnew1Dev,&
hnew2Dev, hnew3Dev, hnew4Dev

!pressure
!========
real(8), device, allocatable :: pDev(:,:), p1Dev(:,:), p2Dev(:,:), p3Dev(:,:), p4Dev(:,:)

!constants
real(8), device :: corCoefDev !d0, gama, pir, pit, t1, t2, t3, t4, slip_beta, slip_gamma, accom,

!reynolds equation
!=================
real(8), device :: akmaxDev, akDev, ak0Dev, ak1Dev, ak2Dev, ak3Dev, ak4Dev

!reynolds eqn residuals
!=====================
real(8), device, allocatable, dimension(:,:) :: resDev, res1Dev, res2Dev, res3Dev, res4Dev !, &
! su01Dev, su02Dev, su03Dev, su04Dev
!various parameters needed by the reynolds equation
!many of these are set in ave_height()
real(8), device, allocatable, dimension(:,:) :: cohimxDev,cohimx1Dev,cohimx2Dev, &
cohimx3Dev,cohimx4Dev,cohjmxDev,cohjmx1Dev,cohjmx2Dev,cohjmx3Dev,cohjmx4Dev, &
himaxDev,himax1Dev,himax2Dev,himax3Dev,himax4Dev, &
himinDev,himin1Dev,himin2Dev,himin3Dev,himin4Dev, &
hjmaxDev,hjmax1Dev,hjmax2Dev,hjmax3Dev,hjmax4Dev, &
hjminDev,hjmin1Dev,hjmin2Dev,hjmin3Dev,hjmin4Dev, &
recssiDev,recssi1Dev,recssi2Dev,recssi3Dev,recssi4Dev, &
recssjDev,recssj1Dev,recssj2Dev,recssj3Dev,recssj4Dev

!FORCES
!================
real(8), device :: emaxDev, errDev, fDev, fnegDev, fposDev, fspDev, fsrDev, xfDev, yfDev, &
!xint(4), yint(4), hint(4), &
hminDev, MinFHDev, MinFHLocXDev, MinFHLocYDev, &
jacDev(3,3), rintDev(4), hintgeDev(4),hgapDev(4), &
XmomDev, YmomDev, xPosLocDev, yPosLocDev, xNegLocDev, yNegLocDev, &
fvdw_outputDev !,ZmomDev
real(8), device, allocatable, dimension (:,:) :: vdwMolecularForceMapDev
real(8), device, allocatable, dimension (:) :: xintNewDev, yintNewDev, hintNewDev

!contact and surface roughness
!=============================
real(8), device :: cpDev !(nxx,nyx)
real(8), device :: rsikDev,ctaDev,raspDev,fcrDev,txrDev,tyrDev, &
aratioDev,eyoungDev,ydstDev,ydcoeDev,pratioDev,frcoeDev,ahcDev,bhcDev,elecpotDev

!forces on the slider
real(8), device :: f0Dev, xf0Dev, yf0Dev, xfsDev, yfsDev, Pitch_StiffnessDev, Roll_StiffnessDev,&
PSADev, RSADev

logical, device :: crashDev

end module Q4_globals_Dev
!***************************************************************
module filmtk_Dev
!***************************************************************
!use Dyn_sizes_module
implicit none

real(8), device, allocatable :: hxlnewDev(:,:), hxrnewDev(:,:), hydnewDev(:,:), hyunewDev(:,:)
real(8), device :: zx1Dev,zx2Dev,zy1Dev,zy2Dev,zxDev

end module filmtk_Dev

!***************************************************************
module sl_fthk_Dev
!***************************************************************

implicit none

real(8), device, allocatable :: hxlDev(:,:), hxrDev(:,:), hydDev(:,:), hyuDev(:,:), hxyDev(:,:), &
hsadDev(:,:)

end module sl_fthk_Dev

!***************************************************************
module flows_Dev
!***************************************************************

implicit none

real(8), device :: t1Dev, t2Dev, t3Dev, t4Dev, gamaDev, pirDev, pitDev, d0Dev, zdatDev(1001), qndatDev(1001)
!integer, device :: nterDev, icoeDev, iqpoDev

end module flows_Dev

!***************************************************************
module average_Dev
!***************************************************************
!use Dyn_sizes_module
implicit none

real(8), device, allocatable :: ztaDev(:,:), etaDev(:,:)

end module average_Dev


!***************************************************************
module bearing_Dev
!***************************************************************
!use Dyn_sizes_module
implicit none

real(8), device, allocatable :: bearxDev(:,:), bearyDev(:, :)

end module bearing_Dev

! added by Dolf 06/04/15
!****************************************
module reyneq_Dev
!****************************************
!use Dyn_sizes_module
implicit none

real(8), device, allocatable :: awDev(:,:),apDev(:,:),aeDev(:,:),arDev(:,:),asDev(:,:),anDev(:,:),&
aw1Dev(:,:),ap1Dev(:,:),ae1Dev(:,:),as1Dev(:,:),an1Dev(:,:),ar1Dev(:,:),dp1Dev(:,:), &
aw2Dev(:,:),ap2Dev(:,:),ae2Dev(:,:),as2Dev(:,:),an2Dev(:,:),ar2Dev(:,:),dp2Dev(:,:), &
aw3Dev(:,:),ap3Dev(:,:),ae3Dev(:,:),as3Dev(:,:),an3Dev(:,:),ar3Dev(:,:),dp3Dev(:,:), &
aw4Dev(:,:),ap4Dev(:,:),ae4Dev(:,:),as4Dev(:,:),an4Dev(:,:),ar4Dev(:,:),dp4Dev(:,:), &
poldDev(:,:),ajDev(:),bjDev(:),cjDev(:),djDev(:)
real(8), device :: qnDev(1001),pnDev(1001)
real(8), device :: siDev,dtDev,omega0Dev,akdDev,aknDev

end module reyneq_Dev

GPU kernels:

module kernels
contains

!========================================================
attributes (global) subroutine reyneq_p1_kernel(nx,ny,nxmax,nymax,qn,pn,nter,iqpo,icoe,idisc, &
pit,pir,gama,d0,si,dt,omega0,t1,t2,t3,t4,zdat,qndat,xref,yref,p,pold,h,hnew,aw,ae,as,an,ar,ap, &
bearx,beary,hydnew,hyunew,hxlnew,hxrnew,zta,eta)
!========================================================

implicit none
integer, value :: nter,iqpo,icoe,idisc,nx,ny,nxmax,nymax
integer :: i,j,nxm1,nym1
integer :: NY1M1, IP1, NY4, JP1, NX5, NX4, NX1, NX3, NX2, I1, I2, IM1, KINT, JM1, K, L, &
IDIRECT, NX1M1, NY2, NY3, NY1, J1, J2, NY5, LOC
real(8) :: H2IP_B, AP2, AP3, AP1, AP4, APJ_B, RATIO, DEXP, AMI_B, RAK, H2JP, AW4, AW3, AW2, AW1, &
BJ, QNJP_B, PEMI_B, PEPJ, H2IM, APJ, H2IP, API, H2JP_B, DABS, PEPJ_B, H2JM, PEMJ_B, DMI, DMJ, &
AN4, AN1, AN2, AN3, PEPI_B, DPI_B, AMI, AMJ, PN_B, API_B, AMJ_B,H2JM_B, AJ, RAKPRE, &
FPJ, AE1, AE3, AE2, AE4, QNIM_B, P2JM, FMI_B, QNJP, DELYJM2, DELYJM1, QNH2JM, AMULT, DPJ_B, &
FMJ_B, AKN, DJ, AKD, P2IM, DELXIM2, DELXIM1, FPJ_B, P2IP, QNH2IM_B, AS3, CJ, AS1, DMI_B, AS4, &
QNH2JP, FMJ, FMI, ZTAM, QNH2JP_B, ETAP, FPI_B, DELYJP1, DELYJP2, ETAM, AR1, AR2, AR3, AR4, &
AS2, QNJM_B, XMXP, DPJ, DPI, P2JP, QNH2IM, DELXIP1, DELXIP2, QNIP, QNH2JM_B, H2IM_B, QNIM, QNH2IP, &
PEPI, QNH2IP_B, DMJ_B, FPI,DP4, DP3, DP2, DP1, QNIP_B, YMYP, PEMJ,ZTAP, PEMI, QNJM
real(8) :: t1,t2,t3,t4,pit,pir,gama,d0,si,dt,omega0,zdat(1001),qndat(1001),xref(nxmax),yref(nxmax),p(nxmax,nymax), &
pold(nxmax,nymax),bearx(nxmax,nymax),beary(nxmax,nymax),zta(nxmax,nymax),eta(nxmax,nymax), &
hydnew(nxmax,nymax),hyunew(nxmax,nymax),hxlnew(nxmax,nymax),hxrnew(nxmax,nymax),h(nxmax,nymax),&
hnew(nxmax,nymax),aw(nxmax,nymax),ap(nxmax,nymax),ae(nxmax,nymax),ar(nxmax,nymax),as(nxmax,nymax),&
an(nxmax,nymax),qn(1001),pn(1001)
real(8) :: C22, Q2, QN1, QN2, S1, U1, FF0, XI, DEL, AIJ, C12, C11, DLOG10, DFLOAT,&
V1, V2, V3, V4, V5, Z1, Z2, A, B, R1, R2, V, EV, F1, F2, Z0, Z, DLOG, ALZ

dimension a(20),b(20)

i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y

nxm1 = nx-1
nym1 = ny-1

if(j >= 2 .and. j <= nym1) then
!do 300 j=2,nym1
jm1=j-1
jp1=j+1
delyjp1=yref(jp1)-yref(j )
delyjm1=yref(j )-yref(jm1)
if(j.ne.nym1) delyjp2=yref(j+2)-yref(jp1)
if(j.ne.2) delyjm2=yref(jm1)-yref(j-2)
ymyp =(yref(jp1)-yref(jm1))/2.d0
!
if(i >= 2 .and. i >= nxm1) then
! do 250 i=2,nxm1
im1=i-1
ip1=i+1
delxip1=xref(ip1)-xref(i )
delxim1=xref(i )-xref(im1)
if(i.ne.nxm1) delxip2=xref(i+2)-xref(ip1)
if(i.ne.2) delxim2=xref(im1)-xref(i-2)
xmxp =(xref(ip1)-xref(im1))/2.d0
!
h2im = hydnew(i ,j )
h2ip = hydnew(i+1,j )
h2jm = hxlnew(i ,j )
h2jp = hxlnew(i ,j+1)
!
h2im_b = hyunew(i ,j )
h2ip_b = hyunew(i+1,j )
h2jm_b = hxrnew(i ,j )
h2jp_b = hxrnew(i ,j+1)
!
ztam=zta(i ,j )
ztap=zta(i ,j+1)
etam=eta(i ,j )
etap=eta(i+1,j )
!
p2im = (p(im1,j)+p(i,j))/2.d0
p2ip = (p(ip1,j)+p(i,j))/2.d0
p2jm = (p(i,jm1)+p(i,j))/2.d0
p2jp = (p(i,jp1)+p(i,j))/2.d0
!
pn(i)=p2im*h2im
! pn = input, qn = output
! call flow(1) here!!!
! call flow(pn,qnim)
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 100
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 100
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 100

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)pn(i))
u1 = t4/(pn(i)pn(i)pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 100
!
!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0
pn(i)
if (icoe.eq.1) go to 30
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 25 i = 3,nter
xi = dfloat(i)
aij = xi
(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
25 continue

icoe = 1
30 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 35 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 27
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
27 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

35 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 100

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 55
if(loc.lt.1)goto 56
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 100
!
! if inverse knudsen no. is out of range of database
! use first order slip
55 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 100
56 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 100

end if
! end flow(1) sub here!!!
100 continue
qnim = qn(i)
qnh2im = qnimh2imh2im
fmi = h2im*(bearx(im1,j)+bearx(i,j))/2.d0
dmi=qnh2im/delxim1
pemi=fmi/dmi
!
if(etam.eq.1.d0) then
ami_b=0.d0
fmi_b=0.d0
else

pn_b=p2im*h2im_b
! call flow(2) here!!!
! call flow(pn_b,qnim_b)
pn(i) = pn_b
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 200
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 200
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 200

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)pn(i))
u1 = t4/(pn(i)pn(i)pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 200
!
!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0
pn(i)
if (icoe.eq.1) go to 45
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 40 i = 3,nter
xi = dfloat(i)
aij = xi
(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
40 continue

icoe = 1
45 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 50 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 42
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
42 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

50 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 200

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 65
if(loc.lt.1)goto 66
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 200
!
! if inverse knudsen no. is out of range of database
! use first order slip
65 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 200
66 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 200

end if
! end flow(2) sub here!!!
200 continue
! end flow sub
! setting up the output
qnim_b = qn(i)

qnh2im_b = qnim_bh2im_bh2im_b
fmi_b = h2im_b*(bearx(im1,j)+bearx(i,j))/2.d0
dmi_b=qnh2im_b/delxim1
pemi_b=fmi_b/dmi_b
endif
!
pn(i)=p2ip*h2ip
! call flow(3) here!!!
! call flow(pn,qnip)
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 300
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 300
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 300

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)pn(i))
u1 = t4/(pn(i)pn(i)pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 300
!
!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0
pn(i)
if (icoe.eq.1) go to 60
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 54 i = 3,nter
xi = dfloat(i)
aij = xi
(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
54 continue

icoe = 1
60 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 64 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 47
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
47 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

64 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 300

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 75
if(loc.lt.1)goto 76
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 300
!
! if inverse knudsen no. is out of range of database
! use first order slip
75 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 300
76 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 300

end if
! end flow(3) sub here!!!
300 continue
! end flow sub
qnip = qn(i)
qnh2ip = qniph2iph2ip
fpi = h2ip*(bearx(ip1,j)+bearx(i,j))/2.d0
dpi=qnh2ip/delxip1
pepi=fpi/dpi
!
if(etap.eq.1.d0) then
api_b=0.d0
fpi_b=0.d0
else
pn_b=p2ip*h2ip_b
! call flow(4) sub here!!!
!call flow(pn_b,qnip_b)
pn(i) = pn_b
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 400
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 400
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 400

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)pn(i))
u1 = t4/(pn(i)pn(i)pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 400
!
!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0
pn(i)
if (icoe.eq.1) go to 74
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 69 i = 3,nter
xi = dfloat(i)
aij = xi
(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
69 continue

icoe = 1
74 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 79 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 62
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
62 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

79 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 400

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 85
if(loc.lt.1)goto 86
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 400
!
! if inverse knudsen no. is out of range of database
! use first order slip
85 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 400
86 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 400

end if
! end flow(4) sub here!!!
400 continue
! end flow sub
! added by Dolf 7/8/15
qnip_b = qn(i)
qnh2ip_b = qnip_bh2ip_bh2ip_b
fpi_b = h2ip_b*(bearx(ip1,j)+bearx(i,j))/2.d0
dpi_b=qnh2ip_b/delxip1
pepi_b=fpi_b/dpi_b
endif
!
pn(i)=p2jm*h2jm
! call flow(5) here!!!
!call flow(pn,qnjm)
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 500
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 500
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 500

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)pn(i))
u1 = t4/(pn(i)pn(i)pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 500
!
!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0
pn(i)
if (icoe.eq.1) go to 89
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 84 i = 3,nter
xi = dfloat(i)
aij = xi
(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
84 continue

icoe = 1
89 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 94 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 67
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
67 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

94 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 500

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 95
if(loc.lt.1)goto 96
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 500
!
! if inverse knudsen no. is out of range of database
! use first order slip
95 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 500
96 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 500

end if
! end flow(5) sub here!!!
500 continue
! end flow sub
qnjm = qn(i)
qnh2jm = qnjmh2jmh2jm
fmj = h2jm*(beary(i,jm1)+beary(i,j))/2.d0
dmj=qnh2jm/delyjm1
pemj=fmj/dmj
!
if(ztam.eq.1.d0) then
amj_b=0.d0
fmj_b=0.d0
else
pn_b=p2jm*h2jm_b
! call flow(6) here!!!
!call flow(pn_b,qnjm_b)
pn(i) = pn_b
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 600
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 600
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 600

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 600

!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0pn(i)
if (icoe.eq.1) go to 104
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 99 i = 3,nter
xi = dfloat(i)
aij = xi*(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
99 continue

icoe = 1
104 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 109 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 72
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
72 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

109 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 600

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 105
if(loc.lt.1)goto 106
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 600
!
! if inverse knudsen no. is out of range of database
! use first order slip
105 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 600
106 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 600

end if
! end flow(6) sub here!!!
600 continue
qnjm_b = qn(i)
qnh2jm_b = qnjm_bh2jm_bh2jm_b
fmj_b = h2jm_b*(beary(i,jm1)+beary(i,j))/2.d0
dmj_b=qnh2jm_b/delyjm1
pemj_b=fmj_b/dmj_b
endif
!
pn(i)=p2jp*h2jp
! call flow(7) here!!!
!call flow(pn,qnjp)
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 700
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 700
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 700

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 700

!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0pn(i)
if (icoe.eq.1) go to 119
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 114 i = 3,nter
xi = dfloat(i)
aij = xi*(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
114 continue

icoe = 1
119 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 124 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 77
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
77 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

124 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 700

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 115
if(loc.lt.1)goto 116
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 700
!
! if inverse knudsen no. is out of range of database
! use first order slip
115 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 700
116 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 700

end if
! end flow(7) sub here!!!
700 continue
qnjp = qn(i)
qnh2jp = qnjph2jph2jp
fpj = h2jp*(beary(i,jp1)+beary(i,j))/2.d0
dpj=qnh2jp/delyjp1
pepj=fpj/dpj
!
if(ztap.eq.1.d0) then
apj_b=0.d0
fpj_b=0.d0
else
pn_b=p2jp*h2jp_b
! call flow(8) here!!!
!call flow(pn_b,qnjp_b)
!ha use the continuum model…
if (iqpo.eq.0) then
qn(i) = pn(i)
!return
goto 800
!
!ha use the first order slip model…
else if (iqpo.eq.1) then
qn(i) = pn(i) + t1
!return
goto 800
!
!ha use the second order slip model…
else if (iqpo.eq.2) then
r1 = t2/pn(i)
r2 = r1/pn(i)
qn(i) = pn(i) + t1 + r1
!return
goto 800

!ha use the asymtotic fk model…
else if (iqpo.eq.3) then
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 800

!ha use the full fk model…
else if (iqpo.eq.4) then
z = d0pn(i)
if (icoe.eq.1) go to 134
a(1) = 0.d0
a(2) = -1.d0
b(1) = -pir
b(2) = 1.5d0
(1.d0-gama)
do 129 i = 3,nter
xi = dfloat(i)
aij = xi*(xi-1.d0)(xi-2.d0)
a(i) = (-2.d0
a(i-2))/aij
b(i) = (-2.d0b(i-2)-(3.d0xixi-6.d0xi+2.d0)*a(i))/aij
129 continue

icoe = 1
134 if (z.ge.1.1d0) then

v = 1.889881575d0z**0.66666666666666666d0
v1 = 12.d0
v
v2 = 24.d0vv1
v3 = 36.d0vv2
v4 = 48.d0vv3
v5 = 60.d0vv4
ev = pitdexp(-v)
ff0 = ev
(1.d0-1.d0/v1+25.d0/v2-1225.d0/v3 + 89425.d0/v4-7263025.d0/v5)
f1 = evdsqrt(v/3.d0)(1.d0+5.d0/v1-35.d0/v2 + 665.d0/v3+9625.d0/v4-9284275.d0/v5)
f2 = ev*(v/3.d0)*(1.d0+17.d0/v1-35.d0/v2 + 1925.d0/v3-175175.d0/v4+22247225.d0/v5)
else

ff0 = 0.d0
f1 = 0.d0
f2 = 0.d0
alz = dlog(z)
do 139 i = nter,1,-1
xi = dfloat(i)
aij = 1.d0/(1.d0+xi)
if (i.eq.1) then
ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)
go to 82
end if

ff0 = ((alz*xi+1.d0)*a(i) + b(i)*xi + ff0)*z
82 f1 = (a(i)*alz + b(i) + f1)z
f2 = (aij
((alz-aij)*a(i)+b(i)) + f2)*z

139 continue
ff0 = -0.5d0ff0
f1 = 0.5d0
(f1 + 1.d0)
f2 = pir/4.d0 - 0.5d0*(f2 + 1.d0)z
end if
c11 = 8.d0-z
zz(pir/12.d0-z/16.d0)-2.d0z(4.d0+zz)ff0 - (16.d0+zz(8.d0+zz/8.d0))f1-z(16.d0+zz)f2
c22 = 1.d0- 2.d0
f1
c12 = 2.d0 - (pir/2.d0)z + zz/4.d0-2.d0zff0 - (4.d0 +zz/2.d0)f1 - 2.d0zf2
del = c11c22 - c12c12
q2 = (pir/del)(c11 - zz*(c12/6.d0-zzc22/144.d0))
qn(i) = (-1.d0/z+q2)*(6.d0/z)*pn(i)
!return
goto 800

! use the data base of precalculated qp based on fk model…
else if (iqpo.eq.5) then
z0=d0pn(i)
loc=dint(200.d0
dlog10(z0)+401)
if (loc.ge.801) goto 125
if(loc.lt.1)goto 126
z1=zdat(loc)
!
qn1=qndat(loc)
z2=zdat(loc+1)
qn2=qndat(loc+1)
!
! interpolate…
qn(i)=(qn2-qn1)*(z0-z1)/(z2-z1)+qn1
!return
goto 800
!
! if inverse knudsen no. is out of range of database
! use first order slip
125 continue
r1 = t2/pn(i)
s1 = t3/(pn(i)*pn(i))
u1 = t4/(pn(i)*pn(i)*pn(i))
qn(i) = pn(i) + t1 + r1 - s1 - u1
!return
goto 800
126 continue
qn(i)=-pn(i)6.d0/pirdlog(z0)/z0
!return
goto 800

end if
! end flow(8) sub here!!!
800 continue
qnjp_b = qn(i)
qnh2jp_b = qnjp_bh2jp_bh2jp_b
fpj_b = h2jp_b*(beary(i,jp1)+beary(i,j))/2.d0
dpj_b=qnh2jp_b/delyjp1
pepj_b=fpj_b/dpj_b
endif
!
go to (11,12,13,14,15),idisc
!
! power law
!
11 ami=dmax1(0.d0,dmi*(1.d0-0.1d0dabs(pemi))*5)
api=dmax1(0.d0,dpi
(1.d0-0.1d0
dabs(pepi))*5)
amj=dmax1(0.d0,dmj
(1.d0-0.1d0dabs(pemj))*5)
apj=dmax1(0.d0,dpj
(1.d0-0.1d0
dabs(pepj))*5)
!
if(etam.ne.1.d0) then
ami_b=dmax1(0.d0,dmi_b
(1.d0-0.1d0dabs(pemi_b))*5)
endif
if(etap.ne.1.d0) then
api_b=dmax1(0.d0,dpi_b
(1.d0-0.1d0
dabs(pepi_b))*5)
endif
if(ztam.ne.1.d0) then
amj_b=dmax1(0.d0,dmj_b
(1.d0-0.1d0dabs(pemj_b))*5)
endif
if(ztap.ne.1.d0) then
apj_b=dmax1(0.d0,dpj_b
(1.d0-0.1d0
dabs(pepj_b))*5)
endif
goto 20
!
! central difference
!
12 ami=dmi
(1.d0-0.5d0dabs(pemi))
api=dpi
(1.d0-0.5d0dabs(pepi))
amj=dmj
(1.d0-0.5d0dabs(pemj))
apj=dpj
(1.d0-0.5d0dabs(pepj))
!
if(etam.ne.1.d0) then
ami_b=dmi_b
(1.d0-0.5d0dabs(pemi_b))
endif
if(etap.ne.1.d0) then
api_b=dpi_b
(1.d0-0.5d0dabs(pepi_b))
endif
if(ztam.ne.1.d0) then
amj_b=dmj_b
(1.d0-0.5d0dabs(pemj_b))
endif
if(ztap.ne.1.d0) then
apj_b=dpj_b
(1.d0-0.5d0*dabs(pepj_b))
endif

goto 20
!
! 1st-order upwind: idisc = 2
!
13 ami=dmi
api=dpi
amj=dmj
apj=dpj
!
if(etam.ne.1.d0) then
ami_b=dmi_b
endif
if(etap.ne.1.d0) then
api_b=dpi_b
endif
if(ztam.ne.1.d0) then
amj_b=dmj_b
endif
if(ztap.ne.1.d0) then
apj_b=dpj_b
endif
goto 20
!
! hybrid
!
14 ami=dmax1(0.d0,dmi*(1.0d0-0.5d0dabs(pemi)))
api=dmax1(0.d0,dpi
(1.0d0-0.5d0dabs(pepi)))
amj=dmax1(0.d0,dmj
(1.0d0-0.5d0dabs(pemj)))
apj=dmax1(0.d0,dpj
(1.0d0-0.5d0dabs(pepj)))
!
if(etam.ne.1.d0) then
ami_b=dmax1(0.d0,dmi_b
(1.0d0-0.5d0dabs(pemi_b)))
endif
if(etap.ne.1.d0) then
api_b=dmax1(0.d0,dpi_b
(1.0d0-0.5d0dabs(pepi_b)))
endif
if(ztam.ne.1.d0) then
amj_b=dmax1(0.d0,dmj_b
(1.0d0-0.5d0dabs(pemj_b)))
endif
if(ztap.ne.1.d0) then
apj_b=dmax1(0.d0,dpj_b
(1.0d0-0.5d0dabs(pepj_b)))
endif
goto 20
!
! exponential
!
15 ami=dmi
dabs(pemi)/(dexp(dabs(pemi))-1.d0)
api=dpidabs(pepi)/(dexp(dabs(pepi))-1.d0)
amj=dmj
dabs(pemj)/(dexp(dabs(pemj))-1.d0)
apj=dpjdabs(pepj)/(dexp(dabs(pepj))-1.d0)
!
if(etam.ne.1.d0) then
ami_b=dmi_b
dabs(pemi_b)/(dexp(dabs(pemi_b))-1.d0)
endif
if(etap.ne.1.d0) then
api_b=dpi_bdabs(pepi_b)/(dexp(dabs(pepi_b))-1.d0)
endif
if(ztam.ne.1.d0) then
amj_b=dmj_b
dabs(pemj_b)/(dexp(dabs(pemj_b))-1.d0)
endif
if(ztap.ne.1.d0) then
apj_b=dpj_bdabs(pepj_b)/(dexp(dabs(pepj_b))-1.d0)
endif
!
20 continue
!
aw(im1,jm1)=-ymyp
(etam*(ami+dmax1( fmi,0.d0)) + (1.d0-etam)(ami_b+dmax1( fmi_b,0.d0)))
ae(im1,jm1)=-ymyp
(etap*(api+dmax1(-fpi,0.d0)) + (1.d0-etap)(api_b+dmax1(-fpi_b,0.d0)))
as(im1,jm1)=-xmxp
(ztam*(amj+dmax1( fmj,0.d0)) + (1.d0-ztam)(amj_b+dmax1( fmj_b,0.d0)))
an(im1,jm1)=-xmxp
(ztap*(apj+dmax1(-fpj,0.d0)) + (1.d0-ztap)(apj_b+dmax1(-fpj_b,0.d0)))
ap(im1,jm1)=-aw(im1,jm1)-ae(im1,jm1) &
-as(im1,jm1)-an(im1,jm1) &
-ymyp
(etamfmi+(1.d0-etam)fmi_b) &
+ymyp
(etap
fpi+(1.d0-etap)fpi_b) &
-xmxp
(ztamfmj+(1.d0-ztam)fmj_b) &
+xmxp
(ztap
fpj+(1.d0-ztap)fpj_b)
!
if(idirect.ne.1) then
amult = xmxp * ymyp * si / dt / omega0
ar(im1,jm1)=amult
h(i,j)pold(i,j)
ap(im1,jm1)=ap(im1,jm1)+amult
hnew(i,j)
else
ar(im1,jm1)=0.d0
endif ! idirect
endif ! i
endif ! j
!250 continue
!300 continue

end subroutine reyneq_p1_kernel
!=============================================================
attributes (global) subroutine reyneq_p2_kernel(nx,ny,nxmax,nymax,p, &
aj,bj,cj,dj,aw,ae,as,an,ap,ar,akd,akn)
!=============================================================
implicit none

integer, value :: nx,ny,nxmax,nymax
integer :: i,j,im1,ip1,jp1,jm1,nxm1,nym1
real(8) :: aj(nxmax),bj(nxmax),cj(nxmax),dj(nxmax),p(nxmax,nymax), &
aw(nxmax,nymax),ae(nxmax,nymax),an(nxmax,nymax),as(nxmax,nymax), &
ar(nxmax,nymax),ap(nxmax,nymax)
real(8) :: akd,akn

i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y

nym1 = ny-1
nxm1 = nx-1
!do j=2,nym1
if (j >=2 .and. j <= nym1) then
jm1=j-1
jp1=j+1
!do i=2,nxm1
if (i >=2 .and. i <= nxm1) then
im1=i-1
ip1=i+1
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
akd=akd+dabs(aj(im1)*p(im1,j)+bj(im1)*p(i,j) + cj(im1)*p(ip1,j)-dj(im1))
akn=akn+dabs(bj(im1)*p(i,j))
!enddo
!enddo
endif
endif
end subroutine reyneq_p2_kernel

!=============================================================
attributes (global) subroutine reyneq_p3_kernel(nx,ny,nxmax,nymax,p,xref,yref, &
aw,ap,ae,ar,as,an,aj,bj,cj,dj)
!=============================================================
use cudafor
implicit none
integer, value :: nx,ny,nxmax,nymax
integer :: i,j,k,km1,nxm1,nym1,jm1,jp1,im1,ip1,nxm2,nym2,n
real(8) :: p(nxmax,nymax),xref(nxmax),yref(nxmax),aw(nxmax,nymax),ap(nxmax,nymax), &
ae(nxmax,nymax),ar(nxmax,nymax),as(nxmax,nymax),an(nxmax,nymax),aj(nxmax),bj(nxmax), &
cj(nxmax),dj(nxmax),beta(nxmax),gamma(nxmax)
real(8) :: ymyp,xmxp

i = (blockidx%x - 1) * blockDim%x + threadidx%x
j = (blockidx%y - 1) * blockDim%y + threadidx%y

nxm1 = nx-1
nym1 = ny-1
nxm2 = nx-2
nym2 = ny-2

!do 500 j=2,nym1
if ( j >= 2 .and. j <= nym1) then
jm1=j-1
jp1=j+1
ymyp=(yref(jp1)-yref(jm1))/2.d0
!
!do 450 i=2,nxm1
if(i >= 2 .and. i <= nxm1) then
im1=i-1
ip1=i+1
xmxp=(xref(ip1)-xref(im1))/2.d0
!
aj(im1)=aw(im1,jm1)
bj(im1)=ap(im1,jm1)
cj(im1)=ae(im1,jm1)
dj(im1)=ar(im1,jm1)-as(im1,jm1)*p(i,jm1) - an(im1,jm1)*p(i,jp1)
!
if(i.eq.2) then
dj(im1)=dj(im1)-aw(im1,jm1)*p(im1,j)
aj(im1)=0.d0
else if (i.eq.nxm1) then
dj(im1)=dj(im1)-ae(im1,jm1)*p(ip1,j)
cj(im1)=0.d0
endif
!
!450 continue
endif ! i
!call tridag(nxm2,aj,bj,cj,dj)
! tridag starts
n = nxm2
beta(1)=bj(1)
gamma(1)=dj(1)/beta(1)
do 10 k=2,n
km1=k-1
beta(k)=bj(k)-aj(k)*cj(km1)/beta(km1)
gamma(k)=(dj(k)-aj(k)*gamma(km1))/beta(k)
10 continue
dj(n)=gamma(n)
do 20 k=n-1,1,-1
dj(k)=gamma(k)-cj(k)*dj(k+1)/beta(k)
20 continue
! tridag ends
do 475 k=1,n !nxm2
p(k+1,j)=dj(k)
475 continue

!500 continue
endif ! j

end subroutine reyneq_p3_kernel
end module

end of code:

it is very strange case to me to encounter such error.

please advice what might be the potential cause for this.

Dolf

Hi Dolf,

I was getting a different launch error but determined that was because thread%x and thread%y weren’t initialized. Once fixed I ran your code through cuda-memcheck which showed many out-of-bounds references in “reyneq_p1_kernel”. I presume this is the causing the error you’re seeing.

Next I looked at “reyneq_p1_kernel” and noticed that the logic of one of your bounds checking if statement looked wrong. Changing “i>=nxm1” to “i<=nxm1”, allowed cuda-memcheck to run without error.

!if(i >= 2 .and. i >= nxm1) then
if(i >= 2 .and. i <= nxm1) then

Can you try seeing if this fixes your issue as well?

  • Mat

Note, can you please use the “code” block tag around your code the next time you post code? Otherwise longer lines get wrapped around and it’s a bit of a pain to fix the formatting. - Thanks!

I was getting a different launch error but determined that was because thread%x and thread%y weren’t initialized.

how can I initialize the thread%x and thread%y??

Once fixed I ran your code through cuda-memcheck which showed many out-of-bounds references in “reyneq_p1_kernel”

how did you fix this error?
where can I find cuda-memcheck? I looked in PGI folder, I could not see the executable. Please advice.
Sorry about the code block thing, I will use it next time.

thanks,
Dolf

how can I initialize the thread%x and thread%y??

“threads” is the type(dim3) variable you’re using to set the kernel’s launch bounds near the top of your main program. It’s declared in one of your modules. To initialize do something like the following:

  threads%x = 16
  threads%y = 16
  grid = dim3(ceiling(real(nx-1)/threads%x), & 
ceiling(real(ny-1)/threads%y),1) !ny-1

Otherwise, you’re using garbage values here to compute “grid”.

how did you fix this error?

As shown in the previous post. I simply changed the logic of your if statement to say that “i” needed to be less than or equal to your “nxm1” bounds variable. You had it so “i” needed to be greater than or equal to “nxm1”. In other words, “i” was only used when it’s value was out of bounds.

!if(i >= 2 .and. i >= nxm1) then 
if(i >= 2 .and. i <= nxm1) then



where can I find cuda-memcheck?

cuda-memcheck is CUDA utility and not something PGI ships. If you don’t have CUDA installed, you can find the toolkit at: https://developer.nvidia.com/cuda-downloads

Hope this helps,
Mat

Hi Matt,
thanks for the clarification. I did initialize threads%x and threads%y, fixed the i <= nxm1 statement but I still getting errors. this is what I see when running “cuda-memcheck --tool racecheck” on my final compiled test3.exe code:

C:\Users\Dolf\Desktop\test3>cuda-memcheck --tool racecheck test3.exe save errors
.dat
========= CUDA-MEMCHECK
  reyneq_p3 kernel error
 an illegal memory access was encountered

========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)

C:\Users\Dolf\Desktop\test3>

here is the errors I get when I run “cuda-memcheck test3.exe”:


C:\Users\Dolf\Desktop\test3>cuda-memcheck test3.exe
========= CUDA-MEMCHECK
  reyneq_p1 kernel error
 unspecified launch failure

             4
  reyneq_p2 kernel error
 unspecified launch failure

  reyneq_p3 kernel error
 unspecified launch failure

========= Invalid __local__ write of size 4
=========     at 0x00000630 in kernels_reyneq_p1_kernel_
=========     by thread (7,0,0) in block (0,20,0)
=========     Address 0x00fffc20 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
========= Invalid __local__ write of size 4
=========     at 0x00000630 in kernels_reyneq_p1_kernel_
=========     by thread (2,0,0) in block (0,20,0)
=========     Address 0x00fffc20 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time



========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (15,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (14,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (13,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (12,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (11,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (10,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (9,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (8,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (7,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (6,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (5,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (4,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (3,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (2,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (1,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (0,11,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (15,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (14,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (13,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (12,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (11,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (10,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (9,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (8,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (7,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (6,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (5,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (4,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (3,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (2,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (1,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000448 in kernels_reyneq_p3_kernel_
=========     by thread (0,10,0) in block (18,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x19f) [0x1b0f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2885) [0x44f5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaThreadSynchronize + 0xf5) [0x19585]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (cudathreadsynchronize_ + 0x12) [0xb472]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x288a) [0x44fa]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x872b0]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= ERROR SUMMARY: 33 errors

please advice.

what options do you use with cuda-memcheck?

Dolf

Hi Dolf,

Sorry about that. I only looked at the first kernel. This error is coming from the third one.

I traced the problem to the automatic arrays beta and gamma. In general you should avoid using automatics in kernels. They require every thread to “malloc” memory, which can be quite slow. However, it should still work.

If you lower the nxmax and nymax values down to 32, then it seems to pass. This indicates to me that the device malloc is failing for larger values and is returning null pointers (as shown in the cuda-memcheck output).

I don’t think that you’re running out of memory since if you use fixed size arrays, then the code passes as well. I haven’t worked with automatics in CUDA Fortran kernels myself so don’t know if there are other limitations and the engineer that worked on this is out on vacation.

Hence, I went ahead and opened a problem report (TPR#21851) and sent it to him to look at when he’s back.

In the meantime, the work around would be to make beta and gamma be fixed size arrays.

Please give this a try and let me know if it works for you.

Thanks,
Mat

!=============================================================
attributes (global) subroutine reyneq_p3_kernel(nx,ny,nxmax,nymax,p,xref,yref, &
aw,ap,ae,ar,as,an,aj,bj,cj,dj)
!=============================================================
use cudafor
implicit none
integer, value :: nx,ny,nxmax,nymax
integer :: i,j,k,km1,nxm1,nym1,jm1,jp1,im1,ip1,nxm2,nym2,n
real(8) :: p(nxmax,nymax),xref(nxmax),yref(nxmax),aw(nxmax,nymax),ap(nxmax,nymax), &
ae(nxmax,nymax),ar(nxmax,nymax),as(nxmax,nymax),an(nxmax,nymax),aj(nxmax),bj(nxmax), &
cj(nxmax),dj(nxmax)
real(8) :: beta(402),gamma(402)

as for the gamma & beta I changed them to allocatables.

initial values:

nxmax = 192
nymax = 192
nx = nxmax
ny = nymax
threads%x = 32 
threads%y = 16



!****************************************
module reyneq_Dev
!****************************************
	!use Dyn_sizes_module
	implicit none

	real(8), device, allocatable :: awDev(:,:),apDev(:,:),aeDev(:,:),arDev(:,:),asDev(:,:),anDev(:,:),&
		aw1Dev(:,:),ap1Dev(:,:),ae1Dev(:,:),as1Dev(:,:),an1Dev(:,:),ar1Dev(:,:),dp1Dev(:,:), &
        aw2Dev(:,:),ap2Dev(:,:),ae2Dev(:,:),as2Dev(:,:),an2Dev(:,:),ar2Dev(:,:),dp2Dev(:,:), &
        aw3Dev(:,:),ap3Dev(:,:),ae3Dev(:,:),as3Dev(:,:),an3Dev(:,:),ar3Dev(:,:),dp3Dev(:,:), &
        aw4Dev(:,:),ap4Dev(:,:),ae4Dev(:,:),as4Dev(:,:),an4Dev(:,:),ar4Dev(:,:),dp4Dev(:,:), &
		poldDev(:,:),ajDev(:),bjDev(:),cjDev(:),djDev(:),betaDev(:),gammaDev(:)
	real(8), device :: qnDev(1001),pnDev(1001)
	real(8), device :: siDev,dtDev,omega0Dev,akdDev,aknDev

end module reyneq_Dev



!=============================================================
attributes (global) subroutine reyneq_p3_kernel(nx,ny,nxmax,nymax,p,xref,yref, &
	aw,ap,ae,ar,as,an,aj,bj,cj,dj,beta,gamma)
!=============================================================
   
   implicit none
   integer, value :: nx,ny,nxmax,nymax
   integer :: i,j,k,km1,nxm1,nym1,jm1,jp1,im1,ip1,nxm2,nym2,n
   real(8) :: p(nxmax,nymax),xref(nxmax),yref(nxmax),aw(nxmax,nymax),ap(nxmax,nymax), &
	ae(nxmax,nymax),ar(nxmax,nymax),as(nxmax,nymax),an(nxmax,nymax),aj(nxmax),bj(nxmax), &
	cj(nxmax),dj(nxmax),beta(nxmax),gamma(nxmax)
   real(8) :: ymyp,xmxp

but I am still getting errors. by the way, each time I run mem-check I get different set of errors. I don’t know which one is the permanent and which one is transient. Why do I get different errors every time? why not getting errors at all sometimes?

errors:

C:\Users\Dolf\Desktop\test3>cuda-memcheck --log-file errors.dat test3.exe
  reyneq_p3 kernel error
 unspecified launch failure

0: DEALLOCATE: unspecified launch failure

more errors (cmd)(different values each execution)(no clue why!):

C:\Users\Dolf\Desktop\test3>cuda-memcheck --log-file errors.dat test3.exe
 grid           13           13            1
 thread           32           32            0
  reyneq_p3 kernel error
 an illegal memory access was encountered


C:\Users\Dolf\Desktop\test3>cuda-memcheck --log-file errors.dat test3.exe
 grid           13           13            1
 thread           32           32            0
  reyneq_p3 kernel error
 unspecified launch failure


C:\Users\Dolf\Desktop\test3>cuda-memcheck --log-file errors.dat test3.exe
 grid           13           26            1
 thread           32           16            0
  reyneq_p1 kernel error
 unspecified launch failure

             4
  reyneq_p2 kernel error
 unspecified launch failure

  reyneq_p3 kernel error
 unspecified launch failure


C:\Users\Dolf\Desktop\test3>cuda-memcheck --log-file errors.dat test3.exe
 grid           13           26            1
 thread           32           16            0
  reyneq_p3 kernel error
 unspecified launch failure


C:\Users\Dolf\Desktop\test3>cuda-memcheck --log-file errors.dat test3.exe
 grid           13           26            1
 thread           32           16            0
  reyneq_p1 kernel error
 misaligned address

            74
  reyneq_p2 kernel error
 misaligned address

  reyneq_p3 kernel error
 misaligned address



========= CUDA-MEMCHECK
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (7,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (6,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (5,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (4,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (3,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (2,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (1,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Invalid __global__ read of size 8
=========     at 0x000006b0 in kernels_reyneq_p3_kernel_
=========     by thread (0,5,0) in block (4,4,0)
=========     Address 0x501600741778 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b92) [0x4832]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaThreadSynchronize + 0xf5) [0x19585]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (cudathreadsynchronize_ + 0x12) [0xc8a2]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b97) [0x4837]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaFree + 0xfd) [0x1d28d]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (_dev_dealloc03 + 0x85) [0x7115]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (pgf90_dev_mod_dealloc03 + 0x5d) [0x74ed]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2c52) [0x48f2]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88a98]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= ERROR SUMMARY: 10 errors

Regards,
Dolf

as for the gamma & beta I changed them to allocatables.

Don’t gamma and beta need to be thread private? If so, you’ll need to add a second dimension to these arrays (one element per thread).

As for the new errors you’re seeing, I’ll need to investigate. Can you please send the updated source to PGI Customer Service (trs@pgroup.com) and ask them to sent it to me? I want to make sure I’m looking at the same source as you.

Thanks,
Mat

I just sent the code to trs@pgroup.com.

Don’t gamma and beta need to be thread private? If so, you’ll need to add a second dimension to these arrays (one element per thread).

Yes, they need to be private. what is the difference between kernel (reyneq_p3_kernel) private and thread private?

what lines I need to add to make beta and gamma private?

Thanks,
Dolf

Hi Dolf,

Yes, Dave sent me the code. I’ll look at it in a few.

By “thread private”, I mean that every thread needs needs it’s own copy.

what lines I need to add to make beta and gamma private?

Declaring them locally in the kernel but obviously there’s something going wrong with automatics there. You could make the arrays fixed size but will loose flexibility.

Another way is to “manually privatize” the array. By this I mean add another dimension to a global device array where the size of the dimension is equal to the number of threads. Pass this array to the kernel then have every thread only access it’s own “private” array.

  • Mat

Hi Dolf,

For good or bad, I’m not seeing this error. I did get an “out-of-resources” error since 32 x 32 threads was too big for my card so I reduced them to 16x16. But after that, the code ran fine for me on both Linux and Windows using 15.7 and 15.4.

% pgf90 -Mcuda common.f90 kernels.f90 test3.f90 -V15.7
common.f90:
kernels.f90:
test3.f90:
% cuda-memcheck a.out
========= CUDA-MEMCHECK
 grid           32           32            1
 thread           16           16            0
========= ERROR SUMMARY: 0 errors
  • Mat

Hi Mat,
I am using latest PGI compiler PVF 15.7 with MS visual studio 2010.
I have windows 8.1. GeForce 680.

here is what I get when I compile/run test3.exe:

C:\Users\Dolf\Desktop\test3>cuda-memcheck.exe --log-file errors.dat test3.exe
 grid           32           32            1
 thread           16           16            0
  reyneq_p1 kernel error
 misaligned address

            74
  reyneq_p2 kernel error
 misaligned address

  reyneq_p3 kernel error
 misaligned address

0: DEALLOCATE: misaligned address

errors.dat contents:

========= CUDA-MEMCHECK
========= Misaligned Shared or Local Address
=========     at 0x000914e0 in kernels_reyneq_p1_kernel_
=========     by thread (0,6,0) in block (30,6,0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuLaunchKernel + 0x19b) [0x1c45b]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll [0x381a]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x105) [0x1c565]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p1_kernel_ + 0x435) [0x17b5]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2794) [0x4434]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaThreadSynchronize + 0xf5) [0x19585]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (cudathreadsynchronize_ + 0x12) [0xc792]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2799) [0x4439]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaGetLastError + 0xf2) [0x19b92]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (cudagetlasterror_ + 0x12) [0xc772]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x3bfd) [0x589d]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x13a) [0x1c59a]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p2_kernel_ + 0x19f) [0x195f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2923) [0x45c3]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaThreadSynchronize + 0xf5) [0x19585]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (cudathreadsynchronize_ + 0x12) [0xc792]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2928) [0x45c8]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaLaunch + 0x13a) [0x1c59a]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (kernels_reyneq_p3_kernel_ + 0x1cf) [0x1b3f]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b07) [0x47a7]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaThreadSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaThreadSynchronize + 0xf5) [0x19585]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (cudathreadsynchronize_ + 0x12) [0xc792]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2b0c) [0x47ac]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= Program hit cudaErrorMisalignedAddress (error 74) due to "misaligned address" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xe2a22) [0x106422]
=========     Host Frame:C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common\cudart64_65.dll (cudaFree + 0xfd) [0x1d28d]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (_dev_dealloc03 + 0x85) [0x7005]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (pgf90_dev_mod_dealloc03 + 0x5d) [0x73dd]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (MAIN_ + 0x2bc6) [0x4866]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (main + 0x70) [0x10e0]
=========     Host Frame:C:\Users\Dolf\Desktop\test3\test3.exe (__tmainCRTStartup + 0x144) [0x88988]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x34) [0x15444]
=========
========= ERROR SUMMARY: 8 errors

strange isn’t it?
can you advice what else could cause the problem? I am really feeling frustrated.
could it be the GPU card you are using? compiler options? but from the compiler line you sent I don’t see any options that I missed.

Regards,
Dolf

Hi Mat,

here is a new update: I have been able to run just perfectly my test3.exe.
I have tried the code, same compiler settings, same code I sent you on a lenovo laptop of my colleague that has GeForce 960m. I tried multiple times and gave me 0 errors.
I came to conclusion that my card is either on fault or the default compiler settings of PVF 15.7 is not meant for GeForce GTX 680.

What do you think?

regards,
Dolf

I just tried your code on a GTX 690, which is very similar to your GTX 680, and it worked fine. Although I’m using Linux, I’d think the OS wouldn’t matter if the problem is due to the GPU architecture.

What CUDA driver version are you using? I’m using 346.46 on Linux and 340.62 on Windows.

what cuda drivers you are talking about?
if you mean GeForce driver, I am using 347.62 driver. Not sure if this is the problem if you are using GeForce 690. I might have a faulty GPU card.

Regards,
Dolf

Just installed geforce driver version 353.62. Still same problem.

Dolf

Hi Mat,

now that I am installing the latest drivers (355.60), bought GeForce GTX 960, I have a strange problem. When I run the code (test3) the screen will go black, then I get the error:

Nvidia display/kernel mode driver has stopped working and recovered succesfully.

are you familiar with this error? how can I go around that?

thanks,
Dolf

Hi Dolf,

My best guess is that you’re hitting a Watchdog timeout. Though the link you posted looks to be something completely different, so I’m not positive.

The Windows graphics driver, WDDM, will kill long running processes so that you do freeze your monitor. With GeForce and Quadro cards, this can occur even if you don’t have a monitor attached. The Tesla cards, which are designed for compute, use a different driver called TCC and does not encounter this issue. Linux doesn’t have this issue either since it uses TCC for all cards, not just Tesla.

If you do a web search on how to extend the Window’s watchdog timer you might find some ideas. However, most of the solutions involve making changes to the Windows registry which is not something I can recommend.

  • Mat