Modified version of the CUFFT example

I’ve been looking at modifying the CUFFT example in /local/software/pgi/11.9/linux86-64/2011/cuda/cudaFortranSDK/. Whilst I can get the original version to compile with no issues I am getting the following complaints at compilation time for a slightly modified version:

[kaw2e11@UOS-205126 isolated_modified_cufft]$ make
pgf90  -fast  -c precision_m.cuf
pgf90  -fast  -c cufft_m.cuf
pgf90 -Mcuda=3.2  -fast  -c fourier_gpu_m.F90
pgf90 -Mcuda=3.2  -fast  -o cufftTest cufftTest.F90 fourier_gpu_m.o precision_m.o -lcufft 
cufftTest.F90:
fourier_gpu_m.o: In function `fourier_gpu_m_fourier_gpu_':
/home/kaw2e11/PROJECTS/GPU/cudaFortranSDK/isolated_modified_cufft/./fourier_gpu_m.F90:18: undefined reference to `cufftexec_'
/home/kaw2e11/PROJECTS/GPU/cudaFortranSDK/isolated_modified_cufft/./fourier_gpu_m.F90:18: undefined reference to `cufftexec_'
fourier_gpu_m.o: In function `.C2_283':
fourier_gpu_m.F90:(.data+0x14): undefined reference to `cufft_m_'
make: *** [cufftTest] Error 2



program cufftTest
  use precision_m
  use fourier_gpu_m

  implicit none

  complex(fp_kind), allocatable :: a(:),b(:)
  integer i
  integer :: n=8

  ! allocate arrays on the host
  allocate(a(n), b(n))

  !initialize arrays on host
  do i = 1, n
    a(i) = cmplx(cos((i-1) * atan2(0.0,-1.0) / n), 0.0)
  end do

  ! Print initial array
  print *, "Array A:"
  write (*,"(8('(',f6.3,',',f6.3,')',1x))") a

  call fourier_gpu(n,a,b)

  ! Copy results back to host
  print *, "Inverse B"
  write (*,"(8('(',f6.3,',',f6.3,')',1x))") b

  ! Scale
  b = b / n
  print *, "Scaled B"
  write (*,"(8('(',f6.3,',',f6.3,')',1x))") b

  !release memory on the host
  deallocate(a, b)

end program cufftTest



module fourier_gpu_m

public fourier_gpu

contains 

subroutine fourier_gpu(n, a, b)
  use precision_m
  use cufft_m

  implicit none
  integer n
  complex(fp_kind) :: a(n),b(n)
  complex(fp_kind), device, allocatable :: a_d(:), b_d(:)
  integer :: plan, planType

  ! Allocate arrays on the device
  allocate(a_d(n), b_d(n))

  ! Copy a array to device
  a_d = a

  ! Set planType to either single or double precision
  if (fp_kind == singlePrecision) then
     planType = CUFFT_C2C
  else
     planType = CUFFT_Z2Z
  endif

  ! Initialize the plan and execute the FFTs.

  call cufftPlan1D(plan,n,planType,1)
  call cufftExec(plan,planType,a_d,b_d,CUFFT_FORWARD)
  call cufftExec(plan,planType,b_d,b_d,CUFFT_INVERSE)

  ! Copy results back to host
  b = b_d

  ! Release memory on the device
  deallocate(a_d, b_d)

  ! Destroy the plan
  call cufftDestroy(plan)

end subroutine fourier_gpu

end module fourier_gpu_m



# All these examples can run with various pgfortran options.  -fast is fine.
F90FLAGS = -fast 

OBJS = cufftTest
all: $(OBJS)

# cufftTest
cufftTest: cufftTest.F90 fourier_gpu_m.o precision_m.o
        pgf90 -Mcuda=3.2 $(CUDAFLAGS) $(F90FLAGS) -o $@ $^ -lcufft 

fourier_gpu_m.o: fourier_gpu_m.F90 cufft_m.o precision_m.o 
        pgf90 -Mcuda=3.2 $(CUDAFLAGS) $(F90FLAGS) -c $<

cufft_m.o: cufft_m.cuf precision_m.o
        pgf90 $(CUDAFLAGS) $(F90FLAGS) -c $<

# auxiliary modules
precision_m.o: precision_m.cuf
        pgf90 $(CUDAFLAGS) $(F90FLAGS) -c $<

#Clean up
clean:
        rm -rf *.o *.mod $(OBJS) *~

Any suggestions?

Cheers,

Karl

Hi KarlW,

You just need to add the cufft_m object to the link.

# All these examples can run with various pgfortran options.  -fast is fine.
F90FLAGS = -fast

OBJS = cufftTest
all: $(OBJS)

# cufftTest
cufftTest: cufftTest.F90 fourier_gpu_m.o precision_m.o cufft_m.o
        pgf90 -Mcuda=3.2 $(CUDAFLAGS) $(F90FLAGS) -o $@ $^ -lcufft

fourier_gpu_m.o: fourier_gpu_m.F90 cufft_m.o precision_m.o
        pgf90 -Mcuda=3.2 $(CUDAFLAGS) $(F90FLAGS) -c $<

cufft_m.o: cufft_m.cuf precision_m.o
        pgf90 $(CUDAFLAGS) $(F90FLAGS) -c $<

# auxiliary modules
precision_m.o: precision_m.cuf
        pgf90 $(CUDAFLAGS) $(F90FLAGS) -c $<

#Clean up
clean:
        rm -rf *.o *.mod $(OBJS) *~



% pgf90 -Mcuda  -fast -o cufftTest cufftTest.F90 fourier_gpu_m.o precision_m.o  -lcufft
cufftTest.F90:
fourier_gpu_m.o: In function `fourier_gpu_m_fourier_gpu_':
/home/colgrove/support/karlw/2_16_12/./fourier_gpu_m.F90:18: undefined reference to `cufftexec_'
/home/colgrove/support/karlw/2_16_12/./fourier_gpu_m.F90:18: undefined reference to `cufftexec_'
fourier_gpu_m.o: In function `.C2_284':
fourier_gpu_m.F90:(.data+0x20): undefined reference to `cufft_m_'
% pgf90 -Mcuda -fast -o cufftTest cufftTest.F90 fourier_gpu_m.o precision_m.o cufft_m.o -lcufft
cufftTest.F90:
% ./cufftTest
 Array A:
( 1.000, 0.000) ( 0.924, 0.000) ( 0.707, 0.000) ( 0.383, 0.000) ( 0.000, 0.000) (-0.383, 0.000) (-0.707, 0.000) (-0.924, 0.000)
 Inverse B
( 8.000, 0.000) ( 7.391, 0.000) ( 5.657, 0.000) ( 3.061, 0.000) ( 0.000, 0.000) (-3.061, 0.000) (-5.657, 0.000) (-7.391, 0.000)
 Scaled B
( 1.000, 0.000) ( 0.924, 0.000) ( 0.707, 0.000) ( 0.383, 0.000) ( 0.000, 0.000) (-0.383, 0.000) (-0.707, 0.000) (-0.924, 0.000)

Hope this helps,
Mat

Yes, that did the trick, thank you!

I was wondering if Accelerator compiler code equivalent to the wrapper code within the CUDA Fortran SDK example (cufft_m.cuf) might be available anywhere?

Equivalent code for OpenACC would also be great but I know that is still only in its early stages…

Cheers,

Karl

Hi KarlW,

If you don’t mind having a CUDA Fortran device allocatable array, you can use the cufft_m.cuf example to handle CUFFT interface and then use the device array in an accelerator region. The PGI Accelerator model/OpenACC and CUDA Fortran are interoperable.

  • Mat

Hi Mat,

That is what I am doing at the moment. Whilst I know there is no functional issue with doing things this way, I was hoping to minimize the number of file types to fit in with the programming guidelines for the software I am working on.

For future reference, can you comment on the compatibility of OpenACC with CUFFT at all?

Many thanks,

Karl

Hi KarlW,

Whilst I know there is no functional issue with doing things this way, I was hoping to minimize the number of file types to fit in with the programming guidelines for the software I am working on.

The other method would be to call the host CUFFT routines. This means extra data movement since your accelerator regions wont share memory with CUFFT, but it would make the program a bit more portable.

Personally, I would use conditional compilation (i.e. pre-processor directives) to work around the portability problems. You’ll need to do this already since you have CUFFT, so it shouldn’t much more work to add in the device attribute in places. Of course, I don’t know your application so please do what’s best for your project.

For future reference, can you comment on the compatibility of OpenACC with CUFFT at all?

There is nothing specific in the OpenACC standard, so compatibility with CUDA libraries will be implantation dependent. However, PGI’s OpenACC implementation will be fully compatible with CUDA libraries including CUFFT.

  • Mat

Hi Karl,

Another one of our application engineers had an example call CUBLAS which you should be able to adapt to CUFFT. Note that due to a compiler issue, this example needs to be compiled with PGI version 12.2 or later.

  • Mat
module abc

contains

subroutine gpu_sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
integer :: m, n, k, lda, ldb, ldc
real, dimension(lda,k) :: a
real, dimension(ldb,n) :: b
real, dimension(ldc,n) :: c
real :: alpha, beta
character*1 :: transa, transb
!$acc reflected(a,b,c)

interface

   subroutine Sgemm_gpu(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) bind(c,name='cublasSgemm')
   integer, value :: m, n, k, lda, ldb, ldc
   real, device, dimension(lda,k) :: a
   real, device, dimension(ldb,n) :: b
   real, device, dimension(ldc,n) :: c
   real, value :: alpha, beta
   character*1, value :: transa, transb
   end subroutine Sgemm_gpu

   subroutine Init_gpu() bind(c,name='cublasInit')
   end subroutine Init_gpu

end interface



call Init_gpu()


call Sgemm_gpu(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)


end subroutine gpu_sgemm

end module abc





program test
use abc
real*4, dimension(100,100) :: a, b, c, e
integer :: m, n, k, lda, ldb, ldc
character*1 :: transa, transb



call random_number(a)
call random_number(b)
c = -9.9e0
e = -9.9e0

m = 100
n = 100
k = 100
lda = 100
ldb = 100
ldc = 100
alpha = 1.0e0
beta = 0.0e0
transa = 'n'
transb = 'n'


call sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)


print *, "Host Results:  "
print *, "Sum(c): ", sum(c)
print *, "----------------------"



!$acc data region copyin(a,b) copyout(e)

call gpu_sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, e, ldc)

!$acc end data region


print *, "Accelerator GPU Results:  "
print *, "Sum(e): ", sum(e)
print *, "----------------------"


end