Accuracy problem or calculation error?

How to do the double precision computing of sqrt() in GPU. I put the loop in GPU with CUDA, the results occurs slightly different. I do not know whether the error happened by accuracy problem(sqrt) or my mistake of computing error?
the original code:

dimension qj0(kdim,idim-1,5,4),fnu(0:jdim,0:kdim,0:idim)
      do i=1,idim-1
        do k=1,kdim-1
          tt=gamma*qj0(k,i,5,1)/qj0(k,i,1,1)
          fnu(0,k,i)=c2bp*tt*sqrt(tt)/(c2b+tt)
          tt=gamma*qj0(k,i,5,3)/qj0(k,i,1,3)
          fnu(jdim,k,i)=c2bp*tt*sqrt(tt)/(c2b+tt)
        enddo
      enddo

modified code:

dimension qj0(kdim,idim-1,5,4),fnu(0:jdim,0:kdim,0:idim)
      type(dim3) :: grid,tBlock
      real,device :: fnu_d(0:jdim,0:kdim,0:idim)
      real,device::qj0_d(kdim,idim-1,5,4)
            tBlock=dim3(16,32,1)
      grid=dim3(ceiling(real(kdim-1)/tBlock%x), 
     +ceiling(real(idim-1)/tBlock%y),1)
      istat=cudaMemcpy(qj0_d,qj0,size(qj0),cudaMemcpyHostToDevice)
      istat=cudaMemcpy(fnu_d,fnu,size(fnu),cudaMemcpyHostToDevice)
      call do434<<<grid,tBlock>>>(fnu_d,qj0_d,gamma,c2bp,c2b,jdim)
      istat=cudaMemcpy(fnu,fnu_d,size(fnu_d),cudaMemcpyDeviceToHost)

device code:

          attributes(global) subroutine do434(a,b,d,e,f,g)
          implicit none
            real::a(:,:,:)
            real::b(:,:,:,:)
            real,value::d,e,f
            integer,value::g
            integer::i,j,n(2)
            real :: c

            i=blockDim%x*(blockIdx%x-1)+threadIdx%x
            j=blockDim%y*(blockIdx%y-1)+threadIdx%y
            n(1)=size(a,2)-2
            n(2)=size(a,3)-2
            if(i<=n(1) .and. j<=n(2)) then
              c=d*b(i,j,5,1)/b(i,j,1,1)
              a(1,i,j)=e*c*sqrt(c)/(f+c)
              c=d*b(i,j,5,3)/b(i,j,1,3)
              a(g+1,i,j)=e*c*sqrt(c)/(f+c)
            end if
          end subroutine do434

I compiled the code with the following option.

mpif90 -DP3D_SINGLE -DPG -DDBLE_PRECSN  -g  -r8  -acc -ta=tesla -Mfixed -Minfo=accel -Mcuda -c twoeqn.F

the compared result

Hi xll_bit,

Sqrt is not expected to be bit for bit accurate between the GPU and CPU so some variation is expected. However, there are some things you can do to try and get them closer.

On the GPU, FMA operations are used by default and depending upon your CPU, they may or may not be used. Try adding “-Mnofma” to disable FMA operations to see if that gets you closer.

Also try adding “-Kieee” which tells the compiler to adhere to IEEE 754.

Hope this helps,
Mat

Thank you! I have tried with -Mnofma and -Kieee option and have not get better result. May be somewhere else was wrong, I will try.

I have found the mistake. I should change the a(1,i,j) to a(1,i+1,j+1) and a(g+1,i,j) to a(g+1,i+1,j+1) in kernel function. Because the scope of array “a(fnu_d)” in CPU is (0:jdim,0:kdim,0:idim).
Besides, another question: What is the life cycle of array “fnu_d” in GPU? Does it exist in this subroutine and how can I destroy the array “fnu_d”?

What is the life cycle of array “fnu_d” in GPU?

From what you show, it looks like fnu_d is an automatic array so will be implicitly allocated upon entry into the subroutine and deallocated upon exit. It’s lifetime will match the lifetime of the subroutine.

Though since you don’t show what arguments are being passed in, it’s also possible that “fnu_d” is an argument. In which case, I’m not sure what the lifetime of the variable is, but it would be determined someplace higher up in the program depending on if this is an static, automatic, or allocatable array and if it’s a local or module variable.

Does it exist in this subroutine

Sorry, but I’m not clear on what you’re asking.

I think you are asking if “fnu_d” is in scope, which it would be since there’s a declaration for it. I’m not positive if it’s an argument or a local automatic array, but either way, it would be within scope.

Or are you asking if the array is allocated on the GPU? Again, this will depend on what “fnu_d” is, but if it is an automatic, then yes, it will be implicitly allocated upon entry into the subroutine.

and how can I destroy the array “fnu_d”?

If an automatic, it will be implicitly deallocated upon exit of the subroutine.

If instead you want explicit control of it’s allocation and deallocation, you’ll want to change this to be an allocatable array.

-Mat

I declare the array “fnu_d” in following way:

      subroutine two(fnu,qj0,...)
      dimension qj0(kdim,idim-1,5,4),fnu(0:jdim,0:kdim,0:idim)
      type(dim3) :: grid,tBlock
      real,device :: fnu_d(0:jdim,0:kdim,0:idim)
      real,device::qj0_d(kdim,idim-1,5,4)
            tBlock=dim3(16,32,1)
            ... ...
       end subroutine

What I know is: if I declare the array “fnu_d” with “real,device,allocatable :: fnu_d(0:jdim,0:kdim,0:idim)”, I can apply for memory on GPU using “cudaMalloc or allocate” and release the memory with “cudaFree or deallocate”. But how can I release the memory of fnu_d on GPU before the end of the subroutine when the array fnu_d was defined by “real,device :: fnu_d(0:jdim,0:kdim,0:idim)”?
I think the answer is in your upper reponse:

and how can I destroy the array “fnu_d”?
If an automatic, it will be implicitly deallocated upon exit of the subroutine.

If instead you want explicit control of it’s allocation and deallocation, you’ll want to change this to be an allocatable array.

Am I right?

Am I right?

Correct. If you want explicit control of when the device memory is allcoated and deallocated you should use an allocatable array instead of an automatic.