The output is wrong! it seems gpu doesnt do the work

Hi everybody

I am giving my first steps with cuda fortran but I am getting a problem with the following simple code incTest.cuf

module simpleOps_m
contains
  attributes(global) subroutine inc(a,b)
    implicit none
    integer :: a(:)
    integer, value :: b
    integer :: i, n

     i= threadIdx%x
       a(i)=a(i)+b

  end subroutine inc
end module simpleOps_m

program incTest
  use cudafor
  use simpleOps_m
  implicit none
  integer :: b, n=256
  integer, allocatable :: a(:)
  integer, allocatable, device :: a_d(:)

  allocate (a(n), a_d(n))
  a=1
  b=3
  
  a_d=a
  call inc<<<1,n>>>(a_d,b)
  a=a_d
 
  if (all(a==4)) &
     write(*,*) 'Test Passed'
  deallocate (a,a_d)

end program incTest

I compiled the code with pgf90 -Minfo -o exe incTest.cuf and I got the following
inctest:
33, all reduction inlined

but when I ran it I didn’t get the message ‘Test Passed’. Before I tested an even simpler code copydat.cuf and it worked

program copyData
  use cudafor
  implicit none
  integer, parameter :: n=256
  real :: a(n), b(n)
  real, device :: a_d(n),b_d(n)

  a=1.0
  a_d=a
  b_d=a_d
  b= b_d

  if (all(a==b)) &
    write(*,*) 'Test passed'
end program copyData

What test can I do to know what is the problem

Thanks for your time

Hi oscar_ml,

Most likely your kernel is failing for some reason. CUDA won’t throw and error, so instead it’s best to add error checking after each kernel. (See below).

What device are you using? I suspect that you might be on a newer or older than device than one of the default targets for your compiler version. For example with PGI 18.4, we target cc35, cc50, and cc60 since we default to using CUDA 8.0. Later compiler versions will also add cc70 as well as use CUDA 9.0.

You may need to set the target compute capability for your device such as “-Mcuda=cc70” for a V100.

Note that in our 18.10 release, we’ll default to target the device that’s detected on the compiling system.

% cat test.cuf
module simpleOps_m
contains
  attributes(global) subroutine inc(a,b)
    implicit none
    integer :: a(:)
    integer, value :: b
    integer :: i, n

     i= threadIdx%x
       a(i)=a(i)+b

  end subroutine inc
end module simpleOps_m

program incTest
  use cudafor
  use simpleOps_m
  implicit none
  integer :: b, n=256, rc
  integer, allocatable :: a(:)
  integer, allocatable, device :: a_d(:)

  allocate (a(n), a_d(n))
  a=1
  b=3

  a_d=a
  call inc<<<1,n>>>(a_d,b)
  rc = cudaGetLastError()
  if (rc.ne.0) then
   print *, "Error: ", cudaGetErrorString(rc)
  endif
  a=a_d

  if (all(a==4)) &
     write(*,*) 'Test Passed'
  deallocate (a,a_d)

end program incTest
% pgf90 test.cuf -V18.4 -Mcuda=cc35
% a.out
 Error:
 invalid device function
% pgf90 test.cuf -V18.4 -Mcuda=cc70
% a.out
 Test Passed

Hope this helps,
Mat

Mat

Many thanks for your reply. You were right. I verified my compiler version

%pgf90 --versionpgf90 
18.4-0 64-bit target on x86-64 Linux -tp sandybridge 
PGI Compilers and Tools
Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.

and my device, which is a quadro K600. Then I compiled the code with the flag -Mcuda=cc30 and got the right answer

% pgf90 test.cuf -V18.4 -Mcuda=cc70
% a.out
 Test Passed

It was a silly mistake. Next time I will come with a more challenging question ;)

BTW: Do you have some link with the heat example using cuda fortran. It will be very useful to write my app

BTW: Do you have some link with the heat example using cuda fortran. It will be very useful to write my app

We do have a laplace example as part of our OpenACC training, but not for CUDA Fortran.

In Chapter 8 from the example sources from the Parallel Programming with OpenACC, there are examples of a 2D-heat program in both OpenACC and CUDA C.

They might help show at least the algorithm and code structure that you could translate in to CUDA Fortran.