Linking CUDA fortran compiled code with ifort

wiersma · May 16, 2013, 9:15pm

Hi all,

Please bear with me if this is obvious - I wouldn’t describe myself as an expert in compilers or with Fortran by any stretch of the imagination.

As a part of evaluating whether CUDA would suit my needs, I tried just dropping one of the matrix multiplication routines (say, here http://geco.mines.edu/software/pg10/gpu/pgicudaforug.pdf) into my code suite and see what happens. I’m having difficulty compiling though.

Normally my code is arranged in modules by file, so would have a file named mod1.f90:

module mod1
!lblah blah blah
end module mod1

And then test.f90 containing:

program test
use mod1
end program test

Which I compile using

ifort -c mod1.f90
ifort -c test.f90
ifort -o test mod1.o test.o

Simple right? Now if I have a cuda fortran module named, say cmod1.cuf:

module cmod1
use cudafor
!lblah blah blah
end module cmod1

I try compiling:

pgf90 -c cmod1.cuf
ifort -c test.f90
ifort -o test cmod1.o test.o

But get

test.f90(4): error #7013: This module file was not generated by any release of this compiler. [cmod1]
use cmod1
-------^

When trying to compile test.f90. So ifort doesn’t like pgf90 compiled modules? Do I have to compile everything with pgf90, because our suite uses some ifort specific things.

(Sorry if this was incredibly verbose, I just wanted to be clear :)).

MatColgrove · May 17, 2013, 12:19am

Hi wiersma,

When trying to compile test.f90. So ifort doesn’t like pgf90 compiled modules? Do I have to compile everything with pgf90, because our suite uses some ifort specific things.

Several F90 features such as modules and allocatable arrays are not compatible between Fortran compilers.

What you’ll need to do is put the CUDA Fortran routines into a library and then have the main Fortran program call the routines via a F77 or C wrapper.

Hope this helps,
Mat

wiersma · May 21, 2013, 8:45pm

Hi all,

Thanks for the reply, but now I’m having difficulty setting up a library.

So now I have a pretty basic example:

cmod1.cuf:

module cmod1
use cudafor

contains

subroutine mmul( A, B, C )
    real, dimension(:,:) :: A, B, C
end subroutine mmul
end module cmod1

test.f90:

program test

    implicit none
    real  :: A(10,10), B(10,10), C(10,10)
    A = 0.
    B = 0.
    C = 0.
    call mmul(A,B,C)
	
end program test

The commands:

pgf90 -c cmod1.cuf
ar rcvs libcmod1.a cmod1.o
ifort -c test.f90
ifort test.o -L. -lcmod1 -o test

give:

test.o: In function `MAIN__':
test.f90:(.text+0x750): undefined reference to `mmul_'

Sorry if this is basic library stuff - I’ve tried multiple configurations, but can’t seem to figure this out.

MatColgrove · May 22, 2013, 6:49pm

Hi wiersma,

So I had never actually tried calling CUDA Fortran code from Intel compiled code. Turned out to be a bit more tricky then I thought. We do some magic during the initialization of the program that’s necessary to get the device working properly. Hence, you have to link with the PGI driver to use CUDA Fortran. Here’s the steps I did:

% cat prog.f90 
! just call the main test program
program testp
   call test() 
end program testp

% cat test.f90 
! change the Intel main program to a subroutine
subroutine test
    implicit none
    real  :: A(10,10), B(10,10), C(10,10)
    A = 10.1
    B = 0.11
    C = 0.
    call cmod1_mmul(A,B,C,10,10)
    print *, A(1,2), B(2,4), C(3,6)
   
end subroutine test 

% cat cmod1.cuf 
! Here's a basic CUDA Fortran kernel

module cmod1
use cudafor

contains

attributes(global) subroutine kernel( A, B, C, N, M )
 real, device :: A(N,M), B(N,M), C(N,M)
 integer, value :: N, M
 integer :: i, j, tx, ty
 ! Get the thread indices
 tx = threadidx%x
 ty = threadidx%y
 i = (blockidx%x-1) * blockdim%x + tx
 j = (blockidx%y-1) * blockdim%y + ty
 if (i .le. N .and. j .le. M) then
   C(i,j) = A(i,j)+ B(i,j)
 endif
end subroutine kernel

! need to pass in arrays a automatic or assumed-shaped
subroutine mmul( A, B, C, N, M)
    implicit none
    real, dimension(n,m) :: A, B, C
    real, device, allocatable, dimension(:,:) :: Ad, Bd, Cd
    integer i,N,M,ierr
    type(dim3) :: dimGrid, dimBlock

    allocate(Ad(N,M), Bd(N,M), Cd(N,M))
    Ad=A
    Bd=B
    dimGrid = dim3( N/16, M/16, 1 )
    dimBlock = dim3( 16, 16, 1 )
    call kernel<<<dimGrid,dimBlock>>>( Ad, Bd, Cd, N, M )
    C=Cd
    deallocate(Ad, Bd, Cd)

end subroutine mmul
end module cmod1 

% pgf90 -c cmod1.cuf 
% ifort -c test.f90 
% pgf90 -c prog.f90 
% pgf90 -Mcuda prog.o test.o cmod1.o -L/opt/intel/composerxe-2013.3.163/compiler/lib/intel64/ -lifport -lifcore -limf
% a.out
 N=           10  M=           10  ierr=            0
   10.10000      0.1100000       10.21000

You can also wrap-up “cmod1.o” into a library.

Mat

wiersma · May 27, 2013, 9:09pm

Works great! Thanks a bunch and I’m glad to know I wasn’t missing something completely obvious :).