cuda fortran module data

I’m still having problems with device resident module data.
OK, it’s late Friday afternoon. Am I doing something wrong here…

With 10.4.0 this fails:

pgfortran -DFAILS -g -r8 -Mextend -Mcuda=cc13,keepptx -c isolate.F
/tmp/pgcudaforLYthVwCYC_fd.gpu(27): error: identifier “_anymodule_17” is undefined

1 error detected in the compilation of “/tmp/pgnvd2buhIG25C5l2.nv0”.
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (isolate.F: 66)
PGF90/x86-64 Linux 10.4-0: compilation aborted

Thanks in advance, Sarah

      module anymodule

      use cudafor

      integer, parameter :: Ncol=3, NS=512, Nx2=2, Nyzt=128, ND=4

#ifdef FAILS
      integer, device, allocatable, dimension(:) :: Leo
#endif

      end module anymodule

! -------------------------------------------------------------------------------
#ifdef FAILS
      attributes(global) SUBROUTINE mult(V2x,U_eo1)
#else
      attributes(global) SUBROUTINE mult(V2x,U_eo1,Leo )
#endif

      use anymodule
      complex*16, device, dimension(Ncol,Ncol,NS,2) :: U_eo1
      complex*16 Vt2_1, Vt2_2

#ifndef FAILS
      integer, device, dimension(Nyzt) :: Leo
#endif

C------------Variables--------------------------------------
      complex*16, device, dimension(Ncol,ND,Nx2,Nyzt)  :: V2x

      integer tidy
!
      ic = threadidx%x          ! 1..3
      tidy = threadidx%y
      iyzt = threadidx%y + (blockidx%x-1) * blockdim%y  ! 1..Nyzt

      ieo = 1

       if ( iyzt <= Nyzt ) then
!      DO 1000 iyzt = 1, Nyzt

       nn = ( 1-Leo(iyzt) )*(Nx2-1) + 1

       iv = Nx2 + (iyzt-1)*Nx2

!        DO 1200 ic = 1, Ncol
         Vt2_1 = 
     &                U_eo1(1,ic,iv,ieo)*2.0
     &              + U_eo1(2,ic,iv,ieo)*3.0
     &              + U_eo1(3,ic,iv,ieo)*4.0
         Vt2_2 =
     &                U_eo1(1,ic,iv,ieo)*0.5
     &              + U_eo1(2,ic,iv,ieo)*1.5
     &              + U_eo1(3,ic,iv,ieo)*3.5

         V2x(ic,1,Nx2,iyzt)= + Vt2_1
         V2x(ic,2,Nx2,iyzt)= + Vt2_2
         V2x(ic,3,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_2),-DBLE(Vt2_2))
         V2x(ic,4,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_1),-DBLE(Vt2_1))
! 1200   CONTINUE
C
! 1000 CONTINUE
      endif

      return
      end subroutine mult

Hi Sarah,

Device code can only access a module’s device data if it’s contained in the same module. So in this case, you just need to make mult a contained subroutine within “anymodule”.

      module anymodule

      use cudafor

      integer, parameter :: Ncol=3, NS=512, Nx2=2, Nyzt=128, ND=4
      integer, device, allocatable, dimension(:) :: Leo

      contains

! -------------------------------------------------------------------------------
      attributes(global) SUBROUTINE mult(V2x,U_eo1)

      complex*16, device, dimension(Ncol,Ncol,NS,2) :: U_eo1
      complex*16 Vt2_1, Vt2_2

C------------Variables--------------------------------------
      complex*16, device, dimension(Ncol,ND,Nx2,Nyzt)  :: V2x

      integer tidy
!
      ic = threadidx%x          ! 1..3
      tidy = threadidx%y
      iyzt = threadidx%y + (blockidx%x-1) * blockdim%y  ! 1..Nyzt

      ieo = 1

       if ( iyzt <= Nyzt ) then
!      DO 1000 iyzt = 1, Nyzt

       nn = ( 1-Leo(iyzt) )*(Nx2-1) + 1

       iv = Nx2 + (iyzt-1)*Nx2

!        DO 1200 ic = 1, Ncol
         Vt2_1 =
     &                U_eo1(1,ic,iv,ieo)*2.0
     &              + U_eo1(2,ic,iv,ieo)*3.0
     &              + U_eo1(3,ic,iv,ieo)*4.0
         Vt2_2 =
     &                U_eo1(1,ic,iv,ieo)*0.5
     &              + U_eo1(2,ic,iv,ieo)*1.5
     &              + U_eo1(3,ic,iv,ieo)*3.5

         V2x(ic,1,Nx2,iyzt)= + Vt2_1
         V2x(ic,2,Nx2,iyzt)= + Vt2_2
         V2x(ic,3,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_2),-DBLE(Vt2_2))
         V2x(ic,4,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_1),-DBLE(Vt2_1))
! 1200   CONTINUE
C
! 1000 CONTINUE
      endif

      return
      end subroutine mult

      end module anymodule

Hope this helps,
Mat

Device code can only access a module’s device data if it’s contained in the same module. So in this case, you just need to make mult a contained subroutine within “anymodule”

It seems to be more than that. Even with NO device data, it seems that a global ( kernel ) routine must be in a module.

Is this the case? This is not documented as far as I can see. CUDA/fortran sec. 3.1.4 does state a device&host routine must be in a module.

Putting the global routines in a module will unfortunately enforce parameter type checking, and I am having problems porting ( for performance ) a fortran-77 style code which aliases arrays as formal parameters.

For example, deleting the module definition module/end-module and use statements from the example code gives the same sort of unsat. external message.

/tmp/pgfortran_tOg63owqMhb.o: In function mmul_': /home/users/saraha/./matmul.CUF:121: undefined reference to mmul_kernel_’

! start the module containing the matrix multiply kernel
!module mmul_mod
!    use cudafor
!    contains

! mmul_kernel computes A*B into C where A is NxM, B is MxL, C is then NxL

    attributes(global) subroutine mmul_kernel( A, B, C, N, M, L )
	use cudafor
       real,device :: A(N,M), B(M,L), C(N,L)
       integer, value :: N, M, L
       integer :: i, j, kb, k, tx, ty

! submatrices are declared to be in CUDA shared memory

       real, shared :: Asub(16,16), Bsub(16,16)

! the value of C(i,j) being computed, a temporary scalar

       real :: Cij

! Start execution, first get my thread indices

       tx = threadidx%x
       ty = threadidx%y

! This thread computes C(i,j) = sum(A(i,:) * B(:,j))

       i = (blockidx%x-1) * 16 + tx
       j = (blockidx%y-1) * 16 + ty

       Cij = 0.0

! Do the k loop in chunks of 16, the block size

       do kb = 1, M, 16

! Fill the submatrices; each of 16x16 threads in the thread block
! loads one element of Asub and Bsub

          Asub(tx,ty) = A(i,kb+ty-1)
          Bsub(tx,ty) = B(kb+tx-1,j)

! Wait until all elements are filled

          call syncthreads()

! Multiply the two submatrices; ! Each of the 16x16 threads accumulates the
! dot product for its element of C(i,j)

          do k = 1,16
             Cij = Cij + Asub(tx,k) * Bsub(k,ty)
          enddo

! Synchronize to make sure all threads are done reading the submatrices before 
! overwriting them in the next iteration of the kb loop

          call syncthreads()

       enddo

! Each of the 16x16 threads stores its element to the global C array

       C(i,j) = Cij

    end subroutine mmul_kernel


! The host routine to drive the matrix multiplication

    subroutine mmul( A, B, C )
	use cudafor

! assumed shape input arrays

       real, dimension(:,:) :: A, B, C

! Array dimensions

       integer :: N, M, L

! allocatable device arrays

       real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev

! dim3 variables to define the grid and block shapes

       type(dim3) :: dimGrid, dimBlock
       integer :: r

! Get the array sizes

       real ctimeall, ctimekernel, flops, mflopskernel, mflopsall
       integer c1, c2, c3, c4

! Begin execution, first determine the sizes of the input arrays

       N = size( A, 1 )
       M = size( A, 2 )
       L = size( B, 2 )

! Start data xfer-inclusive timer and allocate the device arrays using 
! F90 ALLOCATE

       call system_clock( count=c1 )
       allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) )

! Copy A and B to the device using F90 array assignments

       Adev = A(1:N,1:M)
       Bdev = B(1:M,1:L)

! Create the grid and block dimensions

       dimGrid = dim3( N/16, L/16, 1 )
       dimBlock = dim3( 16, 16, 1 )

! Start data xfer-exclusive timer, launch the GPU kernel, wait for completion

       call system_clock( count=c2 )
       call mmul_kernel<<<dimGrid,dimBlock>>>( Adev, Bdev, Cdev, N, M, L )
       r = cudathreadsynchronize()

! Stop data xfer-exlusive timer, copy the results back, stop data xfer-
! inclusive timer

       call system_clock( count=c3 )
       C(1:N,1:L) = Cdev
       call system_clock( count=c4 )

! Calculate inclusive/exclusive execution times, and report MFLOPS

       flops = float(N) * float(M) * float(L)
       ctimekernel = c3 - c2
       mflopskernel = flops / ctimekernel
       ctimeall = c4 - c1
       mflopsall = flops / ctimeall

!  Print out results

       print *, 'Kernel time excluding data xfer:', ctimekernel, ' microseconds'
       print *, 'Megaflops excluding data xfer:  ', mflopskernel
       print *, 'Total time including data xfer: ', ctimeall, ' microseconds' 
       print *, 'Megaflops including data xfer:  ', mflopsall

! Deallocate device arrays and exit

       deallocate( Adev, Bdev, Cdev )

    end subroutine mmul
!end module mmul_mod

! Main program to initialize arrays, invoke mmul, check results

program matmul
!   use mmul_mod
   use cudafor
   real,dimension(:,:),allocatable :: A,B,C,CC
   integer N, M, L
   integer idevice, istat

! Begin execution

   N = 512
   M = 1024
   L = 512
   idevice = 0
   print *,' arrays sized ', N, ' by ', M, ' by ', L
   allocate(A(N,M),B(M,L),C(N,L),CC(N,L))

! Initialize the A and B arrays;  zero out the C array to be computed
! on the GPU, and the CC array to be computed on the host

   do j = 1,M
      do i = 1,N
         A(i,j) = i*10 + j*1000
      enddo
   enddo
   do j = 1,L
      do i = 1,M
         B(i,j) = i-j
      enddo
   enddo
   do j = 1,L
      do i = 1,N
         CC(i,j) = 0.0
         C(i,j) = 0.0
      enddo
   enddo

! Initialize CPU device

  istat = cudaSetDevice(idevice)  

! Call matrix multiply subroutine to execute on the GPU to compute C

   print *,'calling mmul'
   call mmul( A, B, C )
   print *,' C(1,1) = ', C(1,1)
   print *,' C(2,2) = ', C(2,2)

! Perform matrix multiply on host to compute CC

   do i = 1,N
      do j = 1,L
         do k = 1,M
            CC(i,j) = CC(i,j) + A(i,k)*B(k,j)
         enddo
      enddo
   enddo

! Check for errors

   ierr = 0
   do j = 1,L
      do i = 1,N
         diff = abs(C(i,j) - CC(i,j))
         denom = CC(i,j)
         if ( denom == 0.0 ) denom = 1.0
         error = diff / denom
         if ( error > 2.0e-5 ) then
            ierr = ierr + 1
            if ( ierr <= 10 ) then
               print *, 'C(',i,',',j,') = ',C(i,j), ' should be ', CC(i,j), ' error=', error
            endif
         endif
      enddo
   enddo

   if( ierr == 0 )then
      print *, ' No errors found'
   else
      print *, ierr, ' ERRORS FOUND!!!'
   endif

end program

To answer my own question…

Assuming it’s true that kernel (“global”) subroutines must be in a module, the only way to do the array aliasing is to use the approach suggested earlier.

For instance,

complex, device A(2,100)
call subr(A)
...
module cudastuff
attributes(global) subr_kernel(A)
complex, device A(200)
...
end subr_kernel
end module cudastuff

subroutine subr(A)
use cudastuff
complex, device A(200)
call subr_kernel<<<nblock>>>(A)
return
end subroutine subr

I know I’m bumping an old thread, but in case people don’t read the PGI CUDA Fortran user guide closely enough, subprograms with the “global” or “device” attribute are all considered “device” subprograms. Thus, a “global” subroutine must be contained in a module, since it is a device subprogram.

This was not the point of my comments and I’m aware of the limitation on subprograms and modules. Documenting a limitation does not make it a feature.

See Account Login | PGI for encouraging news on mitigating this problem.


Hi sseyler,

I know I’m bumping an old thread, but in case people don’t read the PGI CUDA Fortran user guide closely enough, subprograms with the “global” or “device” attribute are all considered “device” subprograms. Thus, a “global” subroutine must be contained in a module, since it is a device subprogram.

This is not exactly correct in that “global” routines do not have to be contained within a module. Only when calling a “device” routine, do the “global” and “device” routine have to reside in the same module.

However, in order to call a “global” routine from host code, the “global” routine is required to have an explicit interface. Hence it is recommended to use modules since modules implicitly define interfaces.

See > Account Login | PGI > for encouraging news on mitigating this problem.

Yes this will be very nice. Allowing access to device data in external modules was our most requested feature.

  • Mat