block sizes/if loops give inconsistent results

Hi, I have a GPU Kernel (part of a much larger code) which is called from CPU with different block dimensions as shown below.

The 2 block/grid sizes that I used are as follows,
Case # 1 dimblock = dim3(960,1,1) dimgrid = dim3(320,1,1)
Case # 2 dimblock = dim3(480,1,1) dimgrid = dim3(640,1,1)

I only get correct results (=CPU result) for case # 2. Case # 1 gives completely incorrect results.

I know that there are no thread bound checks in the code below and I should ideally get segmentation errors but that doesn’t happen. When I put in the thread bound checks, the situation is still the same (only case # 2 gives correct results).

For both the cases iorth=1. So when I remove the iorth=0 loop from the kernel, I get the correct results for both the cases. This seems to be a very strange behavior and I am wondering if there is something that I am missing here in the code and are there any runtime check flags?

To compile the code, I use the flags, -fast -Mr8 -Minfo=all -Mcuda=5.0 -ta=nvidia,cc20

c Kernel Call from CPU
      call kernel_pc<<<dimGrid,dimBlock>>>(d_phi,d_rhs,d_phi1,
     1         ni,nj,nk,mba,m,iorth,unrelax)



c Device kernel
      attributes(global)
     1 subroutine kernel_pc(phi,rhs,phi1,ni,nj,nk,mba,
     1                                m,iorth,unrelax)

      implicit none
      real,device,dimension(ni,nj,nk,mba),intent(in)::phi,rhs
      real,device,dimension(ni,nj,nk,mba),intent(inout)::phi1
      integer,value,intent(in) :: ni,nj,nk,mba,m,iorth
      real,value,intent(in)::unrelax
      integer :: i,j,k,im1,ip1,jp1,jm1,kp1,km1,nb,ii,jj,kk,
     1           ie,je,ke

      nb = blockidx%x
      ii = threadidx%x
      i = dev_i_b_blk(nb,m)+ii-1
      im1=i-1; ip1=i+1
      jj = threadidx%y
      j = dev_j_b_blk(nb,m)+jj-1
      jm1=j-1; jp1=j+1
      kk = threadidx%z
      k = dev_k_b_blk(nb,m)+kk-1
      km1=k-1; kp1=k+1

      if(iorth.eq.0)then
      phi1(i,j,k,m) = (1. - unrelax) * phi(i,j,k,m)
     1              + unrelax * ( rhs(i,j,k,m) +
     1               ( dev_ap(1,i,j,k,m)*phi(i,jp1,k,m)
     1                +dev_ap(2,i,j,k,m)*phi(i,jm1,k,m)
     1                +dev_ap(3,i,j,k,m)*phi(ip1,j,k,m)
     1                +dev_ap(4,i,j,k,m)*phi(im1,j,k,m)
     1                +dev_ap(5,i,j,k,m)*phi(i,j,kp1,m)
     1                +dev_ap(6,i,j,k,m)*phi(i,j,km1,m)
     1                +dev_ap(7,i,j,k,m)*phi(ip1,jp1,k,m)
     1                +dev_ap(8,i,j,k,m)*phi(im1,jp1,k,m)
     1                +dev_ap(9,i,j,k,m)*phi(ip1,jm1,k,m)
     1               ) * dev_sps(i,j,k,m)
     1              ) / dev_ap(19,i,j,k,m)
      call syncthreads()
      endif

      if(iorth.eq.1)then
      phi1(i,j,k,m) = (1. - unrelax) * phi(i,j,k,m)
     1              + unrelax * ( rhs(i,j,k,m) +
     1               ( dev_ap(1,i,j,k,m)*phi(i,jp1,k,m)
     1                +dev_ap(2,i,j,k,m)*phi(i,jm1,k,m)
     1                +dev_ap(3,i,j,k,m)*phi(ip1,j,k,m)
     1                +dev_ap(4,i,j,k,m)*phi(im1,j,k,m)
     1                +dev_ap(5,i,j,k,m)*phi(i,j,kp1,m)
     1                +dev_ap(6,i,j,k,m)*phi(i,j,km1,m)
     1               ) * dev_sps(i,j,k,m)
     1              ) / dev_ap(19,i,j,k,m)
      call syncthreads()
      endif
      end subroutine

Hi amitamritkar,

My first guess would be that it’s an optimization issue. What happens if you compile with “-O0” instead of “-fast”?

  • Mat

Mat,

Even with -O0 optimization flag, I don’t get the correct results. (Case 1 gives incorrect numbers).
Any other suggestions?

Thanks,
Amit

Ok, then it’s a problem with your code and not optimization. Exactly what, I’m not sure.

Can you send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me?

Note that I’ll be attending the NVIDIA GTC conference this week so it may be a few days before I can investigate.

  • Mat