Hi, I have a GPU Kernel (part of a much larger code) which is called from CPU with different block dimensions as shown below.

The 2 block/grid sizes that I used are as follows,

Case # 1 dimblock = dim3(960,1,1) dimgrid = dim3(320,1,1)

Case # 2 dimblock = dim3(480,1,1) dimgrid = dim3(640,1,1)

I only get correct results (=CPU result) for case # 2. Case # 1 gives completely incorrect results.

I know that there are no thread bound checks in the code below and I should ideally get segmentation errors but that doesn’t happen. When I put in the thread bound checks, the situation is still the same (only case # 2 gives correct results).

For both the cases iorth=1. So when I remove the iorth=0 loop from the kernel, I get the correct results for both the cases. This seems to be a very strange behavior and I am wondering if there is something that I am missing here in the code and are there any runtime check flags?

To compile the code, I use the flags, -fast -Mr8 -Minfo=all -Mcuda=5.0 -ta=nvidia,cc20

```
c Kernel Call from CPU
call kernel_pc<<<dimGrid,dimBlock>>>(d_phi,d_rhs,d_phi1,
1 ni,nj,nk,mba,m,iorth,unrelax)
```

```
c Device kernel
attributes(global)
1 subroutine kernel_pc(phi,rhs,phi1,ni,nj,nk,mba,
1 m,iorth,unrelax)
implicit none
real,device,dimension(ni,nj,nk,mba),intent(in)::phi,rhs
real,device,dimension(ni,nj,nk,mba),intent(inout)::phi1
integer,value,intent(in) :: ni,nj,nk,mba,m,iorth
real,value,intent(in)::unrelax
integer :: i,j,k,im1,ip1,jp1,jm1,kp1,km1,nb,ii,jj,kk,
1 ie,je,ke
nb = blockidx%x
ii = threadidx%x
i = dev_i_b_blk(nb,m)+ii-1
im1=i-1; ip1=i+1
jj = threadidx%y
j = dev_j_b_blk(nb,m)+jj-1
jm1=j-1; jp1=j+1
kk = threadidx%z
k = dev_k_b_blk(nb,m)+kk-1
km1=k-1; kp1=k+1
if(iorth.eq.0)then
phi1(i,j,k,m) = (1. - unrelax) * phi(i,j,k,m)
1 + unrelax * ( rhs(i,j,k,m) +
1 ( dev_ap(1,i,j,k,m)*phi(i,jp1,k,m)
1 +dev_ap(2,i,j,k,m)*phi(i,jm1,k,m)
1 +dev_ap(3,i,j,k,m)*phi(ip1,j,k,m)
1 +dev_ap(4,i,j,k,m)*phi(im1,j,k,m)
1 +dev_ap(5,i,j,k,m)*phi(i,j,kp1,m)
1 +dev_ap(6,i,j,k,m)*phi(i,j,km1,m)
1 +dev_ap(7,i,j,k,m)*phi(ip1,jp1,k,m)
1 +dev_ap(8,i,j,k,m)*phi(im1,jp1,k,m)
1 +dev_ap(9,i,j,k,m)*phi(ip1,jm1,k,m)
1 ) * dev_sps(i,j,k,m)
1 ) / dev_ap(19,i,j,k,m)
call syncthreads()
endif
if(iorth.eq.1)then
phi1(i,j,k,m) = (1. - unrelax) * phi(i,j,k,m)
1 + unrelax * ( rhs(i,j,k,m) +
1 ( dev_ap(1,i,j,k,m)*phi(i,jp1,k,m)
1 +dev_ap(2,i,j,k,m)*phi(i,jm1,k,m)
1 +dev_ap(3,i,j,k,m)*phi(ip1,j,k,m)
1 +dev_ap(4,i,j,k,m)*phi(im1,j,k,m)
1 +dev_ap(5,i,j,k,m)*phi(i,j,kp1,m)
1 +dev_ap(6,i,j,k,m)*phi(i,j,km1,m)
1 ) * dev_sps(i,j,k,m)
1 ) / dev_ap(19,i,j,k,m)
call syncthreads()
endif
end subroutine
```