Device kernel error (are maths operations the problem?)

Hello

I have a question about placing log and atan commands in the device kernel subroutine.

Is it possible to have these two mathematical functions as is possible in normal subroutines processed by the cpu?

In this code I am interested in a(i,j) and I return it to the host in the cpu subroutine. However, this kernel is not working and I suspected the maths operation (log, atan, abs)

This is how I called the device:

       call CoefficientDevice<<<1 ,dim3(20,20,1)>>>( adev, ndev, xdev, ydev, bcdev, nodedev, dnormdev )
       istat = cudathreadsynchronize() 

       print *, 'Device Done'
       pause

       print *, 'changing back to a from adev'
       pause
       call system_clock( count=c33 )
       a = adev
       write(6666,*) a
       pause

and unfortunately I get this error when ‘changing back to a from adev’:
0: copyout Memcpy (host=0x88253f0, dev=0x200000, size=1600) FAILED: 30(unknown error)

my GPU is GT525, 1gb ram

I would really appreciate the help

Ahmed

Hi Ahmed,

Why do you think “log” and “atan2” is the problem? It’s possible but I would think that something else is going on.

Note that the Memcpy error is most likely due to the kernel error. Unless you add error checking after your kernel, the error wont become evident until the next time the device is accessed (like memcpy).

I don’t see anything obviously wrong in your code, but I would start with:

al   = sqrt((y(1,node(2,j))-y(1,node(1,j)))**2 +(y(2,node(2,j))-y(2,node(1,j)))**2)

Are values of “node” always between 1 and n?

  • Mat

Hey Mat,

Thank you for the reply.

I have just read that “log” and “atan2” are computable on the GPU. I wonder if there is a fault with the way I am using these functions.

Are values of “node” always between 1 and n?

Yes and it is an array of 2xn size and all values of node(2,n) are integers and they are transferred from the host to the GPU.

Is there a problem with calling such functions or are there restrictions to my case? Does abs() work on the GPU too?

For your information, I compiled using with and without -Mcuda=fastmath and both result in the same error (while running the exe file).

  • Ahmed

Is there a problem with calling such functions or are there restrictions to my case? Does abs() work on the GPU too?[\quote]These intrinsics are supported in CUDA Fortran for the data types listed in Chapter 3 of the > CUDA Fortran Programming Guide> . If there was a problem with how you’re using these intrinsics, I would expect the results to include ‘NaNs’ rather than cause a kernel error.

What is the return code from your kernel?

call CoefficientDevice<<<1 ,dim3(20,20,1)>>>( adev, ndev, xdev, ydev, bcdev, nodedev, dnormdev ) 
  print *, cudaGetErrorString(cudaGetLastError())
>
>
>
> I did just notice that you have a syncthreads which due to the if statement wont get executed by all threads.  This can cause problems.  What happens if you comment it out?  (It appears to be extraneous).
>
> - Mat

Hey Mat

What is the return code from your kernel?

Code:
call CoefficientDevice<<<1 ,dim3(20,20,1)>>>( adev, ndev, xdev, ydev, bcdev, nodedev, dnormdev )
print *, cudaGetErrorString(cudaGetLastError())

I tried it and it printed out ‘no error’. Any ideas?

  • Ahmed

I tried it and it printed out ‘no error’. Any ideas?

Ok, then let’s focus on the copy. How are “a” and “adev” declared? Are they allocatables? If so, how are they allocated?

Minor note, “cudathreadsynchronize” is deprecated. You should use “cudeDeviceSynchronize” instead. They mean the same thing so I highly doubt it’s causing your error, but it’s better to update the code just in case.

Hey Mat
Here are the allocatables and allocates:

      program
      use cudafor
      implicit real*8 (a-h,o-z)
....
      allocatable :: a(:,:),u(:),x(:,:),y(:,:),node(:,:),bc(:,:),dnorm(:,:)
....
      allocate (a(n,n),u(n),x(2,n),y(2,n),node(2,n),bc(2,n))
      allocate (dnorm(2,n))
....
	call coefficient(a,n,x,y,bc,node,dnorm)
....
      end program

    subroutine coefficient(a,n,x,y,bc,node,dnorm)
    use cudafor

       dimension a(n,n),x(2,n),y(2,n),bc(2,n),node(2,n),dnorm(2,n),a_coeff(n,n)
 
! a(n,n) is the same as a_coeff(n,n)
! allocatable device arrays

       real, device, allocatable, dimension(:,:) :: adev,xdev,ydev,dnormdev,bcdev
       integer, device, allocatable, dimension(:,:) :: nodedev
       integer, device :: ndev
....
       allocate( adev(n,n),xdev(2,n),ydev(2,n),dnormdev(2,n) )
       allocate( bcdev(2,n),nodedev(2,n) )
....
    end subroutine coefficient

with:
n=2 and ‘a’ should have 400 elements in it to be done on the device

Is it ok to have the subroutine ‘coefficient’ and the device subroutine just pasted after the program without being placed in a module? (although I doubt this is the problem)

  • Ahmed

Is it ok to have the subroutine ‘coefficient’ and the device subroutine just pasted after the program without being placed in a module? (although I doubt this is the problem)

That depends. Do have explicit interfaces? If not, then this will be problematic.

Interfaces are required to pass allocable arrays and needed to call a CUDA Fortran kernels. Interfaces are implicit when the routines are defined in a module so I recommend using modules. Otherwise you need to provide an explicit interface.

  • Mat

Dear Mat

Please have a look at the whole code. I have minimized it to include the parts that call the ‘coefficient subroutine’ from a module and added the device code to the module too. I have made it simple for you to read:

https://www.dropbox.com/sh/hcyvs9moi2dbjta/0yAbMSeRP5

I have also attached an input file to go with the executable.

I am lost and this would be of tremendous help.

I thank you dearly.

  • Ahmed

Hi Ahemd,

Let me send you a link where you can upload the file using NVIDIA’s secure FTP. Unfortunately our IT won’t let us access dropbox.

  • Mat

Uploaded

  • Ahmed

Great. It looks like you have a number of errors in your program unrelated to CUDA Fortran. Adding “implicit none” would have helped here since a few of the errors were mismatched types and using an undefined variable as the size of an allocatable. The one CUDA Fortran error I saw was that you were passing a device scalar by value to a kernel. Only host variables should be passed by value.

I’ve sent you the changes I’ve made and was able to run the program to completion. There’s probably additional issues, but I’ll let you take it from here.

Hope this helps,
Mat

Dear Mat

I really appreciate you going through these lengths with me. I owe you a cup of coffee.

Thank you

  • Ahmed

Dear Mat

I got it to work! You are always helpful, one of a kind.
It is 3:40 am here and I am extremely happy with my working code, thanks to you!

  • Ahmed

That’s great Ahmed. Now get some sleep!

  • Mat