runtime error when use mpi, cuda fortran and CULA together

bsb3166 · July 21, 2011, 5:38pm

Might be a pgfortran bug? Could any one help me?

I met a problem when I try to use mpi, cuda fortran and CULA library together on multiple GPUs.

A runtime error occur when I use PGI compiler ( version 10.6 ) pgfortran and mpif90 to compile a mpi+cuda fortran code which calls some CULA routines and has kernel function.

After I get the runtime error, I write a simple code to find the problem. There are only four files.

main.f : the main fortran file which initialize the mpi environment and get the CPU rank and name. Then call a subroutine to set up CULA.

more_mpi.f : declare some variables about mpi in a module

cluster.cuf : contain subroutines to set up the device in using different routines(init_cuda, init_cula), cula_status_check routine ( ), and a subroutine to test CULA routine.

acm_dev.cuf : declare some GPU device in a module

Problem 1:

I build the code from a pure fortran code, then extend the cuda fortran on it. The problem result from the acm_dev.cuf. If I compile acm_dev.cuf and add its object file to the final executable file called mpi_cudafor, the runtime error (36) info is given when init_cula() is called in the main function (running on 4 cores). But the variables in acm_dev have never been used anywhere at all, even the module acm_dev. No program or subroutine have “use acm_dev”.

 cpuid:             3
 GPU device             3  will be selected
 Selecting Device FROM CULA
 cpuid:             0
 GPU device             0  will be selected
 Selecting Device FROM CULA
 runtime error (           36 )
 runtime error (           36 )
 cpuid:             1
 GPU device             1  will be selected
 Selecting Device FROM CULA
 runtime error (           36 )
 cpuid:             2
 GPU device             2  will be selected
 Selecting Device FROM CULA
 runtime error (           36 )

However, if I don’t add the object file (acm_dev.o) into the final executable file, I didn’t get the error when the program call CULA_SELECTDEVICE().

The only different is whether acm_dev.o compiled with executable file or not.

Problem 2:

If I still compile acm_dev.o with executable file, but call init_cuda() instead of init_cula(). No same error happened.

Conclusion from last two situations.

cudasetdevice() works fine.

CULA_SELECTDEVICE() doesn’t work fine here.

Problem 3:

If the variables in avm_dev.cuf are declared in a subroutine instead of a module as before, no same error happened. It really confuse me.

main.f

program main
   use more_mpi
   use cula_module
   call mpi_init(ierr)
   call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
   call mpi_comm_size(mpi_comm_world,numprocs,ierr) 
   call mpi_get_processor_name(processor_name,namelen,ierr)
   
   ! call init_cuda()
   call init_cula()
   ! call cluster()
   
   call mpi_finalize(ierr)
end program

more_mpi.f

module more_mpi
   include 'mpif.h'
   integer :: ierr,cpuid,numprocs,namelen !mpi
   character(len=100) processor_name
end module

cluster.cuf

module cula_module
    use cudafor 
    use more_mpi
! ***** some CULA related variables *****
    INTEGER CULA_STATUS   
    INTEGER DEVICE_ID
    CHARACTER(len=100) BUF
    INTEGER BUF_SIZE
    PARAMETER (BUF_SIZE=100) 
! if use DEVICE_INFO_BUF_SIZE = 100000: segmental fault
  
    EXTERNAL CULA_SELECTDEVICE
    EXTERNAL CULA_INITIALIZE
    EXTERNAL cula_device_cgesvd
    EXTERNAL CULA_SHUTDOWN
    EXTERNAL CULA_GETDEVICEINFO
    EXTERNAL CULA_GETEXECUTINGDEVICE

  
    INTEGER CULA_SELECTDEVICE
    INTEGER CULA_INITIALIZE
    INTEGER cula_device_cgesvd
    INTEGER CULA_GETDEVICEINFO
    INTEGER CULA_GETEXECUTINGDEVICE
    
    INTEGER CULA_CGESV !cula
    
    integer :: gpuid,numdevices !gpu
    integer :: info
    type(cudadeviceprop) :: prop

    contains
    subroutine init_cuda()
                  
       info=cudaGetDeviceCount(numdevices)
       gpuid=mod(cpuid,numdevices)
       ! gpuid=1
       write(*,*) 'cpuid: ', cpuid
       write(*,*) 'GPU device ', gpuid, ' will be selected'
       info=cudasetdevice(gpuid)
       info=cudagetdeviceProperties(prop,gpuid)
       write(*,"(a9,i2,a12)") "There are",numdevices,"GPU device!"
       write (*,"(a21,i2,a4,i1,a4,a30)"), "Hello world! process ",cpuid," of ",numnodes," on ",processor_name
       write (*,"(a6,i2)") "GPU id",gpuid
       write (*,"(a12,a20)") "Device name ",prop%name
    end subroutine init_cuda   

    subroutine init_cula()
            
       gpuid=cpuid
       ! gpuid=1
       write(*,*) 'cpuid: ', cpuid
       write(*,*) 'GPU device ', gpuid, ' will be selected'

          
       WRITE(*,*) 'Selecting Device FROM CULA'
       CULA_STATUS = CULA_SELECTDEVICE(cpuid)
       CALL CHECK_STATUS(CULA_STATUS)

       WRITE(*,*) 'Initializing CULA'
       CULA_STATUS = CULA_INITIALIZE()
       CALL CHECK_STATUS(CULA_STATUS)

       ! info=cudasetdevice(gpuid)
       
       WRITE(*,*) 'Getting Device ID FROM CULA'
       STATUS = CULA_GETEXECUTINGDEVICE(DEVICE_ID)
       CALL CHECK_STATUS(CULA_STATUS)
       WRITE(*,*) "Device ID: ",DEVICE_ID 
       
       WRITE(*,*) 'Getting Device Info FROM CULA'
       CULA_STATUS = CULA_GETDEVICEINFO(DEVICE_ID, BUF, BUF_SIZE)
       CALL CHECK_STATUS(CULA_STATUS)
       WRITE(*,*) "BUF: ",BUF
       
    end subroutine init_cula     
    
    
    
    subroutine cluster()
   
       complex :: u(3,3),vt(4,4),a(3,4)
       real :: s(3)
       real :: start,finish
       complex,allocatable,device :: ad(:,:)
       integer :: pitch_ad
       complex,device :: ud(3,3),vtd(4,4)
       real,device :: sd(3)
    
       info=cudaGetDeviceCount(numdevices)
       gpuid=mod(cpuid,numdevices)
       ! gpuid=1
      ! write(*,*) 'cpuid: ', cpuid
      ! write(*,*) 'gpuid: ', gpuid
      ! info=cudasetdevice(gpuid)
      ! info=cudagetdeviceProperties(prop,gpuid)
      ! write(*,"(a9,i2,a12)") "There are",numdevices,"GPU device!"
      ! write (*,"(a21,i2,a4,i1,a4,a30)"), "Hello world! process ",cpuid," of ",numprocs," on ",processor_name
      ! write (*,"(a6,i2)") "GPU id",gpuid
      ! write (*,"(a12,a20)") "Device name ",prop%name
       
       
       m=3
       n=4
       lda=3
       ldu=3
       ldvt=4
       a=reshape((/(5.91,-5.69),(-3.15,-4.08),(-4.89,4.20),(7.09,2.72),(-1.89,3.27),(4.10,-6.70),(7.78,-4.06),(4.57,-2.07),(3.28,-3.84),(-0.79,-7.21),(-3.88,-3.30),(3.84,1.19)/),(/3,4/))
       info=cudamallocpitch(ad,pitch_ad,n,m)
       info=cudamemcpy2d(ad,pitch_ad,a,n*4,n*4,m,cudamemcpyhosttodevice)
    
       info = cula_selectdevice(cpuid)
       call check_status(info)
       !Initialize CULA 
       info=cula_initialize()
       call check_status(info)
       call cpu_time(start)
       info=cula_device_cgesvd('a','a', M, N, ad, LDA, sd,ud, LDU,vtd, LDVT)
       call check_status(info)
       call cpu_time(finish)
       info=cudamemcpy(s,sd,3,cudamemcpydevicetohost)
       write(*,*) s
       write(*,*) "GPU time=",finish-start,"s"
         call cula_shutdown()
       info=cudafree(ad)
       info=cudafree(sd)
    end subroutine cluster
    
    subroutine check_status(culastatus)
       integer culastatus
       integer info
       integer cula_geterrorinfo
    
       info = cula_geterrorinfo()
       if (culastatus .ne. 0) then
          if (culastatus .eq. 7) then
             !culaargumenterror
             write(*,*) 'invalid value for parameter ', info
          else if (culastatus .eq. 8) then
             !culadataerror
             write(*,*) 'data error (', info ,')'
          else if (culastatus .eq. 9) then
             !culablaserror
             write(*,*) 'blas error (', info ,')'
          else if (culastatus .eq. 10) then
             !cularuntimeerror
             write(*,*) 'runtime error (', info ,')'
          else
             !others
             call cula_getstatusstring(culastatus)
          endif
          stop 1
       end if
    end subroutine check_status

end module cula_module

acm_dev.cuf

module acm_dev
    use cudafor
    integer, parameter:: b4 = selected_real_kind(4)
     complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
     complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
     complex(b4), device, allocatable :: epsm1_dev, cn_dev
     
     complex(b4), device, allocatable :: base_dev(:,:) ! constant
     complex(b4), device, allocatable :: material_dev(:) ! constant
     complex(b4), device, allocatable :: ei_dev(:) ! constant
     
     integer, device, allocatable :: gene_dev(:,:)
                    ! integer, device, allocatable :: vector_dev(:) ! should be a shared memroy declared in device subprogram
     integer, device, allocatable :: nbox_dev ! might not needed
end module acm_dev

makefile:

.SUFFIXES: .cuf .o

L1= main.o cluster.o more_mpi.o acm_dev.o

PGFOR=pgfortran
PF90= mpif90

LINK1=  /opt/pgi/linux86-64/11.5/lib/libcudafor.a

#Change to -Mmpi2 for MPICH2
#MPI=-Mmpi
#add cuf
#CUDA=-ta=nvidia -Mcuda
CUDA=
#lib
CULALIB=-L${CULA_LIB_PATH_64} -lcula -lcula_pgfortran -llapack -lblas
#include
CULAINC= -I${CULA_INC_PATH}
#free format
PGFLAGS = -Mfree -O3
#MPICH include
MPICHINCLUDES=-I/opt/pgi/linux86-64/10.6/mpi/mpich/include/
#MPICH lib
MPICHLIBPATH64=-L/opt/pgi/linux86-64/10.6/mpi/mpich/lib/

mpi_cudafor: $(L1)
	$(PF90) $(PGFLAGS) $(L1) $(CULAINC) $(CULALIB)  $(LINK1) -o mpi_cudafor

.f.o:
	$(PF90) $(PGFLAGS) -c $(CULAINC) $(CULALIB) $<

.cuf.o:
	$(PGFOR) $(PGFLAGS) $(CUDA) $(CULAINC) $(CULALIB) -c $<	



main.o: main.f cluster.o more_mpi.o

cluster.o: cluster.cuf more_mpi.o

more_mpi.o: more_mpi.f 

acm_dev.o: acm_dev.cuf

clean:
	rm -f *.o *.mod mpi_cudafor
del:
	rm -f *edu

which mpif90 pgf90 pgfortran

/opt/lib/openmpi/1.4.2/pgi/10.6/bin/mpif90
/opt/pgi/linux86-64/10.6/bin/pgf90
/opt/pgi/linux86-64/10.6/bin/pgfortran

run the job:

mpiexec -np 4 ./mpi_cudafor

MatColgrove · July 21, 2011, 7:26pm

Hi bsb3166,

The 10.6 release was the first CUDA Fortran version to allow module device variables. In this first implementation, the device context would be created at the start of the program and made it impossible to change the device (and is the cause of your errors). In the second implementation released in 10.8, the device context creation is delayed until first use.

Please try using 10.8 or later. If possible, I’d recommend you use the latest version, 11.7, since we’ve added a lot of enhancements in the last year.

Best Regards,
Mat

bsb3166 · July 21, 2011, 11:04pm

Thank you so much. I’ll try 11.5.

Topic		Replies	Views
If use multiple GPUs, can I set device variables globally? Legacy PGI Compilers	9	15147	September 14, 2011
odd problem when use cula Legacy PGI Compilers	0	3815	November 3, 2015
Cuda mpi fortran Legacy PGI Compilers	5	773	October 12, 2021
Internal compiler error for CUDA Fortran code Legacy PGI Compilers	3	5080	March 23, 2018
Different results with -Mcuda=emu / -Mcuda with simple code Legacy PGI Compilers	17	15553	December 10, 2009
pgfortran-Fatal Legacy PGI Compilers	1	2241	November 22, 2011
Signal 11 when compiling for profiling Legacy PGI Compilers	15	15772	September 10, 2015
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21423	February 26, 2010
GPU does not work why? Legacy PGI Compilers	9	15551	March 5, 2010
CudaFotran compiling problem When i am comipiling the cuda fortran code, type mismatch error is com CUDA Programming and Performance	13	3778	December 1, 2009

runtime error when use mpi, cuda fortran and CULA together

Related topics