If use multiple GPUs, can I set device variables globally?

Am I right?

if I try to use multiple GPUs in mpi environment (each CPU has access to one GPU), device memory should be allocated after cudasetdevice(gpuid) or cula_selectdevice(gpuid). Other, copy data from CPUs to GPUs would result in error, because before the device id is specified, all the device memory allocated are on Device 0 as default.

If in my project, I want to use MPI, multiple GPUs, CULA device routines, CUDA Fortran Kernel function, can I set all the device variables in a moulde to make all of them global variables, then use this module in other subroutines?

Thank you very much.

Hi bsb3166,

I may not fully understand the question but I’ll do my best to answer.

if I try to use multiple GPUs in mpi environment (each CPU has access to one GPU), device memory should be allocated after cudasetdevice(gpuid) or cula_selectdevice(gpuid). Other, copy data from CPUs to GPUs would result in error, because before the device id is specified, all the device memory allocated are on Device 0 as default.

In 10.8 or later, the device context is created at the time of first use. So after you set the device number via cudasetdevice, the device will be associated the first time you allocate an array, copy data to the device, or launch a kernel. Note that you shouldn’t use cula_selectdevice since it will attempt to create a different context. For details on using CULA from CUDA Fortran, the following two articles may be helpful.

Using GPU-enabled Math Libraries with PGI Fortran
Using the CULA GPU-enabled LAPACK Library with CUDA Fortran

If in my project, I want to use MPI, multiple GPUs, CULA device routines, CUDA Fortran Kernel function, can I set all the device variables in a moulde to make all of them global variables, then use this module in other subroutines?

You can have device module data that is accessible to the host and all device subroutines within the same module. If you have a release 11.4 or later and a device with compute capability 2.0 or higher, device module data is also accessible by device routines in other modules.

Keep in mind though, that this data is not global across MPI processes. Each MPI process will have it’s own copy of the data so it’s up to your program to keep the data coherent. Though, this is normal for MPI programming.

Hope this helps,
Mat

Dear Mat,

I may not fully understand the question but I’ll do my best to answer.



In 10.8 or later, the device context is created at the time of first use. So after you set the device number via cudasetdevice, the device will be associated the first time you allocate an array, copy data to the device, or launch a kernel. Note that you shouldn’t use cula_selectdevice since it will attempt to create a different context. For details on using CULA from CUDA Fortran, the following two articles may be helpful.

The problem I met is that I have a simple test code which is MPI+CUDA Fortran+CULA routines.

When I have a code like this and make it run on 4 CPUs connected with 4 GPUs by PCIs.

Segmentation Fault, cann’t find memory address:

real,device :: sd(3)

cula_status = cula_selectdevice(gpuid)
call check_status(cula_status)

cula_status = cula_initialize()
call check_status(cula_status)

info=cula_device_cgesvd('a','a', M, N, ad, LDA, sd,ud, LDU,vtd, LDVT)

Works fine:

real,allocatable,device :: sd(:)

cula_status = cula_selectdevice(gpuid)
call check_status(cula_status)

allocate( sd(3) )

cula_status = cula_initialize()
call check_status(cula_status)

info=cula_device_cgesvd('a','a', M, N, ad, LDA, sd,ud, LDU,vtd, LDVT)

The difference is one I set the dimension as sd(3) which is not allocatable before I call cula_selectdevice(), the other one I set sd() as allocatable before calling cula_selectdevice(), then allocate the memory after call the select device routine.

I guess the segmentation fault result from real,device :: sd(3) is trying to allocate memory on all the 4 devices (every GPU is supposed to have its own device array sd(3) ), but because devices have not been selected for the 4 CPUs.

This issue make me think about the other issue which has same MPI+CUDA Fortran+CULA and has segmentation fault as well in my project.

I have several device variables declared in a module as allocatable.

module acm_dev
    use numz, only : b4
    use cudafor
    
     complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
     complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
     
     complex(b4), device, allocatable :: base_dev(:,:) ! constant
     complex(b4), device, allocatable :: material_dev(:) ! constant
     complex(b4), device, allocatable :: ei_dev(:) ! constant
     
     integer, device, allocatable :: gene_dev(:,:)

end module acm_dev

and I have subroutine to allocate them, then copy from CPU to GPU.

subroutine laser_sub_1( )
        use acm_dev
        use cudafor
        allocate( c_dev(nmax,nmax), b_dev(nmax,1) )
        allocate( eps_dev(nmax), cnray_dev(nmax), ei_dev(nmax) )     
        allocate( base_dev(nmax,nmax), material_dev(0:3) ) 
        allocate( gene_dev(gene_size,pop_size) )

        c_dev(1:nmax,1:nmax) = c(1:nmax,1:nmax)
        b_dev = b(1:nmax,1:1) 
        eps_dev  = eps(1:nmax)
        cnray_dev = cnray(1:nmax)
        ei_dev = ei(1:nmax)  
        base_dev = base(1:nmax,1:nmax)
        material_dev = material(0:3)
        gene_dev = gene(1:gene_size,1:pop_size)
end subroutine laser_sub_1

I just declare the subroutine laser_sub_1, but I don’t call it at all. It seems to affect my program’s execution. The program is running on 4 cores with 4 GPUs.

When I have those assignment statement to copy data form GPU to CPU, the program runs to the half then break down. Error info is segmentation fault, memory address not mapped.

When I comment out those assignment statement to copy data form GPU to CPU. the program runs perfect.

Because my project is trying to extend from a CPU version PGA code to a GPU version to boost the performance. There shouldn’t be any memory segmentation on CPU. So I guess it might be similar issue as the simple case I describe above.

But It still really confuse me, there should be no difference if I just put several lines there but don’t use them at all. But actually, it does.

If the change my subroutine

from

module acm_dev
    use numz, only : b4
    use cudafor
    
     complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
     complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
     
     complex(b4), device, allocatable :: base_dev(:,:) ! constant
     complex(b4), device, allocatable :: material_dev(:) ! constant
     complex(b4), device, allocatable :: ei_dev(:) ! constant
     
     integer, device, allocatable :: gene_dev(:,:)

end module acm_dev

module acm_dev is in a separate file. Global allocatable device variables.

subroutine laser_sub_1( )
        use acm_dev
        use cudafor
        allocate( c_dev(nmax,nmax), b_dev(nmax,1) )
        allocate( eps_dev(nmax), cnray_dev(nmax), ei_dev(nmax) )     
        allocate( base_dev(nmax,nmax), material_dev(0:3) ) 
        allocate( gene_dev(gene_size,pop_size) )

        c_dev(1:nmax,1:nmax) = c(1:nmax,1:nmax)
        b_dev = b(1:nmax,1:1) 
        eps_dev  = eps(1:nmax)
        cnray_dev = cnray(1:nmax)
        ei_dev = ei(1:nmax)  
        base_dev = base(1:nmax,1:nmax)
        material_dev = material(0:3)
        gene_dev = gene(1:gene_size,1:pop_size)
end subroutine laser_sub_1

to

     subroutine laser_sub_1( )


        complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
        complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
        
        complex(b4), device, allocatable :: base_dev(:,:) ! constant
        complex(b4), device, allocatable :: material_dev(:) ! constant
        complex(b4), device, allocatable :: ei_dev(:) ! constant
        
        integer, device, allocatable :: gene_dev(:,:)   
          
        info=cudasetdevice(gpuid)        
  
        allocate( c_dev(nmax,nmax), b_dev(nmax,1) )
        allocate( eps_dev(nmax), cnray_dev(nmax), ei_dev(nmax) )     
        allocate( base_dev(nmax,nmax), material_dev(0:3) ) 
        allocate( gene_dev(gene_size,pop_size) )


 
        c_dev(1:nmax,1:nmax) = c(1:nmax,1:nmax)
        b_dev = b(1:nmax,1:1) 
        eps_dev  = eps(1:nmax)
        cnray_dev = cnray(1:nmax)
        ei_dev = ei(1:nmax)  
        base_dev = base(1:nmax,1:nmax)
        material_dev = material(0:3)
        gene_dev = gene(1:gene_size,1:pop_size)
        
     end subroutine laser_sub_1

my program works fine again.

The different is the later subroutine follow the steps to first, declared allocatable device variable, set devices for each CPU, allocate memory, copy from CPU to GPU. (The code is MPI code and this subroutine, still, it just be there, I don’t call it at all).

Thank you Mat, the thread seems very long for you. Thank you for your patience.

Chong

Hi Chong,

cula_status = cula_selectdevice(gpuid)

I haven’t done much with CULA myself, but I don’t think you want to use this function. I’m thinking it creates a separate context, performs it’s own initialization and either wipes out or prevents ours from being performed. I’m not sure, but I’d try just using cudaSetDevice and the CULA CUDA Fortran module (i.e. ‘use cula’) as shown in the articles I posted about.

module acm_dev is in a separate file. Global allocatable device variables.

What type GPUs do you have? In order to support this feature, you need the Unified Memory model available only on a Fermi card (i.e. compute capability 2.0).

  • Mat

I post on CULA forum to ask that is this way below

cuda_status = cudasetdevice(gpuid)
call check_status_cuda(cuda_status )

cula_status = cula_initialize()
call check_status(cula_status)

ok for running CULA on multiple GPUs. An engineer of CULA replied it should work, and he said, culaSelectDevice is merely a passthrough for cudaSelectDevice so that you are not forced to invoke the CUDA toolkit if you only need the Host interface in CULA.

So I tried again to use cudasetdevice(gpuid) instead of cula_selectdevice(gpuid). But still the same problem occurred.

It brings me up the question that Does cula_initialize() create a separate context as well?

Does cula_initialize() create a separate context as well?

No this should be fine.

Maybe something else is happening. Are you trying to call the CULA host routines with device data, or CULA device routines with host data? Can you write-up a simple example that show your error?

To test cula_initialize, I took the following CULA example program and added some basic MPI calls. Granted, I’m only running on a single node with two GPUs, but it seems to work for me.

% cat test_cula.cuf 
       module cula_test
      
            use cudafor
            contains
                          
            ! gpu error reporting routine
            subroutine check_status(status)
            
                integer status
                integer info
                integer cula_geterrorinfo
                info = cula_geterrorinfo()
                if (status .ne. 0) then
                    if (status .eq. 7) then
                        write(*,*) 'invalid value for parameter ', info
                    else if (status .eq. 8) then
                        write(*,*) 'data error (', info ,')'
                    else if (status .eq. 9) then
                        write(*,*) 'blas error (', info ,')'
                    else if (status .eq. 10) then
                        write(*,*) 'runtime error (', info ,')'
                    else
                        call cula_getstatusstring(status)
                    endif
                    stop 1
                end if
                
            end subroutine
            
            ! cpu test (baseline)
            subroutine do_cpu_test(n,nrhs,ain,bin)
                
                ! input
                real,dimension(:,:) :: ain,bin
                
                ! allocations
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs
                integer c1,c2,cr,cm
                real norm
                
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin                
                
                ! start test
                call system_clock( c1, cr, cm )
                print *, 'starting cpu test...'
                               
                ! call lapack solver
                call sgesv(n,nrhs,a,n,ipiv,b,n,info) 
                
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
                
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
                
                ! cleanup
                deallocate(a,b,ipiv,ans)
                
            end subroutine do_cpu_test
            
            ! cula test (host interface)
            subroutine do_cula_host_test(n,nrhs,ain,bin)
                
                ! input
                real,dimension(:,:) :: ain,bin
                
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
                
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin                
                
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (host interface) test...'
                               
                ! call cula solver (host interface)
                status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
                call check_status(status)
                
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
                
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
                
                ! cleanup
                deallocate(a,b,ipiv,ans)
                
            end subroutine do_cula_host_test
                       
            ! cula test (device interface)
            subroutine do_cula_device_test(n,nrhs,ain,bin)
            
                ! input
                real,dimension(:,:) :: ain,bin
                
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
                
                ! gpu memory
                real,device,dimension(:,:),allocatable :: a_dev,b_dev
                integer,device,dimension(:),allocatable :: ipiv_dev
                
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ans(n,nrhs) )
                a(1:n,1:n) = ain
                b(1:n,1:nrhs) = bin                
                
                ! allocate gpu memory
                allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )
                
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (device interface) test...'
                
                ! copy memory to gpu
                a_dev = a
                b_dev = b
                
                ! call cula solver (device interface)
                status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
                
                ! copy answer to host
                b = b_dev
                
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
                
                ! check answer
                ans(1:n,1:nrhs) = bin;
                call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
                
                ! cleanup
                deallocate(a,b,ans)
                deallocate(a_dev,b_dev,ipiv_dev)
                
            end subroutine do_cula_device_test
            
        end module cula_test
        
        ! main program

        program cula

            use cula_test
            include 'mpif.h'
 
            real error,eps
            
            ! Host memory
            real,dimension(:,:),allocatable :: a, b
            integer n, info, i, j, status
  	    integer :: ierr, myid, numprocs, rc, mydev, numdev

            call MPI_INIT( ierr )
            call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
            call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
   
            ierr = cudaGetDeviceCount(numdev)
	    mydev = mod(myid,numdev) 
 	
            print *, "Process ", myid, " of ", numprocs, " took GPU: ", mydev
            ierr = cudaSetDevice(mydev)

            n = 10000
            nrhs = 1

            print *,'cula + pgfortran test (matrix solve)'
            print *,'  array size: ', n, ' by ', n
            print *,'  right hand sides: ', nrhs
            print *,''
            allocate( a(n,n), b(n,nrhs) )
                                   
            ! intialize a and b
            call random_number(a)
            call random_number(b)
            
            ! Make sure a() isn't singular
            do i=1,n
                a(i,i) = 10. * a(i,i) + 10.
            enddo
                       
            ! initialize cula
            status = cula_initialize()
            call check_status(status)
            
            ! do cpu test (baseline)
            call do_cpu_test(n,nrhs,a,b)
                                           
            ! do gpu test (host interface)
            call do_cula_host_test(n,nrhs,a,b)
            
            ! do gpu test (device interface)
            call do_cula_device_test(n,nrhs,a,b)

            call MPI_FINALIZE(rc)
             
        end program cula

% pgf90 -Mmpi -V11.7 -fast -o test_cula.out test_cula.cuf -L/usr/local/cula/lib64 -lcula_pgfortran -lcula -lcublas -L /usr/pgi/linux86-64/2011/acml/4.4.0/lib -lacml
% mpirun -machinefile machines.LINUX -np 4 test_cula.out
 Process             0  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33464.90     ms
   gflops:    19.72215    
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    15007.96     ms
   gflops:    43.97665    
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    1733.560     ms
   gflops:    380.7195    
   error:   4.5987219E-03
 
 Process             3  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33225.06     ms
   gflops:    19.86452    
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    6606.524     ms
   gflops:    99.90125    
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    8241.342     ms
   gflops:    80.08405    
   error:   4.5987219E-03
 
 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33354.10     ms
   gflops:    19.78767    
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    15385.24     ms
   gflops:    42.89825    
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    1710.793     ms
   gflops:    385.7860    
   error:   4.5987219E-03
 
 Process             2  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33443.48     ms
   gflops:    19.73479    
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    4807.774     ms
   gflops:    137.2777    
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    8166.234     ms
   gflops:    80.82061    
   error:   4.5987219E-03

Thanks a lot for your test code.

But I found a problem in the test result either in yours or mine.

The performance of cula_sgesv and cula_device_sgesv on one CPU with one GPU should be around 100 Gflops and 350 Gflops.

Everyone would think when it runs on multiple cores with multiple GPUs, the performance of the cula host and device routines should keep the same with when they are executed on one core with one GPU.

However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They’re supposed to keep.

Do you think it’s the problem of the cula routines or the devices?

I also did some test on CULA with the similar code that you use, and found the same problem and even worse performance dropping.

But I did some test on similar routines on MAGMA as well. Not performance dropping problem when it runs on multiple GPUs.

Below is my test result on 4 GPUs compared with it runs one 1 CPU with 1 GPU.

My test code is slightly different from yours, I put the MPI initialization module in a separate file. I compile mpi file with mpif90 (pf90 and openmpi implementation), compile the cuf code with pgfrotran.

1 CPU + 1 GPU:

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    36437.05     ms
   gflops:    18.11343
   error:   5.0552888E-03

 starting cula (host interface) test...
   runtime:    4183.666     ms
   gflops:    157.7564
   error:   1.2658446E-03

 starting cula (device interface) test...
   runtime:    3822.783     ms
   gflops:    172.6491
   error:   1.2658446E-03

4 CPUs + 4 GPUs:

Process             0  of             4  took GPU:             0
 Process             3  of             4  took GPU:             3
 Process             2  of             4  took GPU:             2
 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
 starting cpu test...
 starting cpu test...
 starting cpu test...
   runtime:    35684.69     ms
   gflops:    18.49533
   error:   5.0552888E-03

   runtime:    35932.07     ms
   gflops:    18.36799
   runtime:    35987.90     ms
   gflops:    18.33949
   error:   5.0552888E-03

   error:   5.0552888E-03

   runtime:    37404.25     ms
   gflops:    17.64506
   error:   5.0552888E-03

 starting cula (host interface) test...
 starting cula (host interface) test...
 starting cula (host interface) test...
 starting cula (host interface) test...
   runtime:    7521.451     ms
   gflops:    87.74902
   error:   1.2658446E-03

   runtime:    170592.8     ms
   gflops:    3.868863
   error:   1.2658446E-03

   runtime:    178027.4     ms
   runtime:    178026.6     ms

   gflops:    3.707310
   gflops:    3.707294
   error:   1.2658446E-03

   error:   1.2658446E-03

 starting cula (device interface) test...
 starting cula (device interface) test...
 starting cula (device interface) test...
 starting cula (device interface) test...
   runtime:    241970.9     ms
   gflops:    2.727601
   error:   1.2658446E-03

   runtime:    242817.9     ms
   gflops:    2.718086
   error:   1.2658446E-03

   runtime:    245411.0     ms
   gflops:    2.689365
   error:   1.2658446E-03

   runtime:    245853.4     ms
   gflops:    2.684527
   error:   1.2658446E-03

Compared with 157.75 and 172.64 Gflops on single CPU with single GPU, when it runs on 4 GPUs, it drops to 3 Gflops and 2 Gflops for host and device cula sgesv routines.

I’m working on a project which’ll use this cula routines with MPI+multiple GPUs. Since this issue, this definitely won’t get any advantage to use CULA routines instead of same routines in MKL (The MKL routines I test is 18 Gflops).

Do you have any experience on this? I’ll really appreciate if you can give me some advice.

My test code:
test_cula.cuf

module cula_test 
      
            use cudafor 
            contains 
                          
            ! gpu error reporting routine 
            subroutine check_status(status) 
            
                integer status 
                integer info 
                integer cula_geterrorinfo 
                info = cula_geterrorinfo() 
                if (status .ne. 0) then 
                    if (status .eq. 7) then 
                        write(*,*) 'invalid value for parameter ', info 
                    else if (status .eq. 8) then 
                        write(*,*) 'data error (', info ,')' 
                    else if (status .eq. 9) then 
                        write(*,*) 'blas error (', info ,')' 
                    else if (status .eq. 10) then 
                        write(*,*) 'runtime error (', info ,')' 
                    else 
                        call cula_getstatusstring(status) 
                    endif 
                    stop 1 
                end if 
                
            end subroutine 
            
            ! cpu test (baseline) 
            subroutine do_cpu_test(n,nrhs,ain,bin) 
                
                ! input 
                real,dimension(:,:) :: ain,bin 
                
                ! allocations 
                real,dimension(:,:),allocatable :: a,b,ans 
                integer,dimension(:),allocatable :: ipiv 
                integer n,nrhs 
                integer c1,c2,cr,cm 
                real norm 
                
                ! back up input for reconstruction test 
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) ) 
                a = ain 
                b = bin                
                
                ! start test 
                call system_clock( c1, cr, cm ) 
                print *, 'starting cpu test...' 
                                
                ! call lapack solver 
                call sgesv(n,nrhs,a,n,ipiv,b,n,info) 
                
                ! stop test 
                call system_clock( count=c2 ) 
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms' 
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9) 
                
                ! check answer 
                ans = bin; 
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n) 
                norm = slange('1',n,nrhs,ans,n,work) / real(n) 
                print *, '  error:', norm 
                print *, '' 
                
                ! cleanup 
                deallocate(a,b,ipiv,ans) 
                
            end subroutine do_cpu_test 
            
            ! cula test (host interface) 
            subroutine do_cula_host_test(n,nrhs,ain,bin) 
                
                ! input 
                real,dimension(:,:) :: ain,bin 
                
                ! allocations (all on host) 
                real,dimension(:,:),allocatable :: a,b,ans 
                integer,dimension(:),allocatable :: ipiv 
                integer n,nrhs,status 
                integer c1,c2,cr,cm 
                real norm 
                
                ! back up input for reconstruction test 
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) ) 
                a = ain 
                b = bin                
                
                ! start test 
                call system_clock( c1,cr,cm ) 
                print *, 'starting cula (host interface) test...' 
                                
                ! call cula solver (host interface) 
                status = cula_sgesv(n,nrhs,a,n,ipiv,b,n) 
                call check_status(status) 
                
                ! stop test 
                call system_clock( count=c2 ) 
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms' 
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9) 
                
                ! check answer 
                ans = bin; 
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n) 
                norm = slange('1',n,nrhs,ans,n,work) / real(n) 
                print *, '  error:', norm 
                print *, '' 
                
                ! cleanup 
                deallocate(a,b,ipiv,ans) 
                
            end subroutine do_cula_host_test 
                        
            ! cula test (device interface) 
            subroutine do_cula_device_test(n,nrhs,ain,bin) 
            
                ! input 
                real,dimension(:,:) :: ain,bin 
                
                ! allocations (all on host) 
                real,dimension(:,:),allocatable :: a,b,ans 
                integer n,nrhs,status 
                integer c1,c2,cr,cm 
                real norm 
                
                ! gpu memory 
                real,device,dimension(:,:),allocatable :: a_dev,b_dev 
                integer,device,dimension(:),allocatable :: ipiv_dev 
                
                ! back up input for reconstruction test 
                allocate( a(n,n), b(n,nrhs), ans(n,nrhs) ) 
                a(1:n,1:n) = ain 
                b(1:n,1:nrhs) = bin                
                
                ! allocate gpu memory 
                allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) ) 
                
                ! start test 
                call system_clock( c1,cr,cm ) 
                print *, 'starting cula (device interface) test...' 
                
                ! copy memory to gpu 
                a_dev = a 
                b_dev = b 
                
                ! call cula solver (device interface) 
                status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n) 
                
                ! copy answer to host 
                b = b_dev 
                
                ! stop test 
                call system_clock( count=c2 ) 
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms' 
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9) 
                
                ! check answer 
                ans(1:n,1:nrhs) = bin; 
                call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n) 
                norm = slange('1',n,nrhs,ans,n,work) / real(n) 
                print *, '  error:', norm 
                print *, '' 
                
                ! cleanup 
                deallocate(a,b,ans) 
                deallocate(a_dev,b_dev,ipiv_dev) 
                
            end subroutine do_cula_device_test 
            
        end module cula_test 
        
        ! main program 

        program cula 

            use cula_test 
            use more_mpi

            
            ! Host memory 
            real,dimension(:,:),allocatable :: a, b 
            integer n, info, i, j, status 
            integer :: rc, mydev, numdev 

            
            call init
    
            ierr = cudaGetDeviceCount(numdev) 
            mydev =  mod(cpuid,numdev) 
     
            print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev 
            !ierr = cudaSetDevice(mydev) 
            ierr = cula_selectdevice(mydev)
            call check_status(ierr)

            n = 10000 
            nrhs = 1 

            print *,'cula + pgfortran test (matrix solve)' 
            print *,'  array size: ', n, ' by ', n 
            print *,'  right hand sides: ', nrhs 
            print *,'' 
            allocate( a(n,n), b(n,nrhs) ) 
                                    
            ! intialize a and b 
            call random_number(a) 
            call random_number(b) 
            
            ! Make sure a() isn't singular 
            do i=1,n 
                a(i,i) = 10. * a(i,i) + 10. 
            enddo 
            
            call MPI_Barrier(mpi_comm_world,ierr)
            
            ! initialize cula 
            status = cula_initialize() 
            call check_status(status) 
            
            ! do cpu test (baseline) 
            call do_cpu_test(n,nrhs,a,b) 
            call MPI_Barrier(mpi_comm_world,ierr)
                                
            ! do gpu test (host interface) 
            call do_cula_host_test(n,nrhs,a,b) 
            call MPI_Barrier(mpi_comm_world,ierr)
            
            ! do gpu test (device interface) 
            call do_cula_device_test(n,nrhs,a,b) 
            call MPI_Barrier(mpi_comm_world,ierr)

            call MPI_FINALIZE(rc) 
              
        end program cula

init.f

subroutine init
   use more_mpi
   call mpi_init(ierr)
   call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
   call mpi_comm_size(mpi_comm_world,numprocs,ierr) 
   call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine init

more_mpi.f

module more_mpi
   include 'mpif.h'
   integer :: ierr,cpuid,numprocs,namelen !mpi
   character(len=100) processor_name
end module

makefile

.SUFFIXES: .cuf .o

L1= test_cula.o more_mpi.o init.o
L2= 

CULAINCLUDES= -I${CULA_INC_PATH}
CULALIBPATH64= -L${CULA_LIB_PATH_64}


CUDAINCLUDES= -I${CUDA_INC_PATH}
CUDALIBPATH64= -L${CUDA_LIB_PATH_64}

CUDALIBS= -lcudart -lcuda


GPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart 

PGFLAGS= -Mfree -O3

#CUDA= -ta=nvidia -Mcuda
CUDA=


SOPT= 
LINK1=  /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       -lpthread

#LINK_CU=  /opt/pgi/linux86-64/10.6/lib/libcudafor.a
LINK_CU=  /opt/pgi/linux86-64/11.5/lib/libcudafor.a



PF90= mpif90

PGFOR= pgfortran


PGA_EX= cula_test_mgpus

darwin: $(L1) $(L2)
	$(PF90) $(SOPT) $(PGFLAGS) $(L1) $(L2) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) $(LINK1) $(LINK_CU) -o $(PGA_EX)

.f.o:
	$(PF90) $(SOPT) $(PGFLAGS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS)$(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -c $<

.cuf.o:
	$(PGFOR) $(SOPT) $(PGFLAGS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -c $<

test_cula.o: test_cula.cuf init.o more_mpi.o

more_mpi.o: more_mpi.f

init.o: init.f more_mpi.o


clean:
	/bin/rm -f *o *mod $(L1b) $(L2b) $(PGA_EX) 
	
del:
	rm -f *.mio.mines.edu *.000 *.001 *.002 *.003

move:
	mv *.edu ./result

However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They’re supposed to keep.

This is because I was running 4 MPI processes but only have 2 GPUs. When I run with only 2 MPI processes, I get the expected 350 Gflops.

With your test code, I get similar results where there is a slight slow down going from 1 to 2 MPI process, and a much larger slow down when going to 4 since I only have 2 GPUs. Why your seeing such a large slow down I’m not sure. Granted, I’m using MPICH rather than OpenMPI, but I doubt that’s the cause.

  • Mat
% mpirun -machinefile machines.LINUX -np 1 test_cula.out
 Process             0  of             1  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    32975.53     ms
   gflops:    20.01484
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    1804.901     ms
   gflops:    365.6710
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    1691.823     ms
   gflops:    390.1117
   error:   4.5987219E-03

% mpirun -machinefile machines.LINUX -np 2 test_cula.out
 Process             0  of             2  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    32948.79     ms
   gflops:    20.03108
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    1997.535     ms
   gflops:    330.4072
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    1791.703     ms
   gflops:    368.3646
   error:   4.5987219E-03

 Process             1  of             2  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33058.83     ms
   gflops:    19.96441
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    1940.003     ms
   gflops:    340.2057
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    1760.786     ms
   gflops:    374.8326
   error:   4.5987219E-03

% mpirun -machinefile machines.LINUX -np 4 test_cula.out
 Process             0  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33447.90     ms
   gflops:    19.73218
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    7527.005     ms
   gflops:    87.68427
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    4662.672     ms
   gflops:    141.5497
   error:   4.5987219E-03

 Process             2  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33480.27     ms
   gflops:    19.71311
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    7160.245     ms
   gflops:    92.17562
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    3568.168     ms
   gflops:    184.9689
   error:   4.5987219E-03

 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33291.96     ms
   gflops:    19.82460
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    6522.568     ms
   gflops:    101.1871
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    6788.867     ms
   gflops:    97.21799
   error:   4.5987219E-03

 Process             3  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33557.15     ms
   gflops:    19.66794
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    8348.127     ms
   gflops:    79.05965
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    6695.790     ms
   gflops:    98.56940
   error:   4.5987219E-03

Thank you for you reply.

Is the mpirun you use in the sub-directory pf pgi compiler “/opt/pgi/linux86-64/2010/mpi/mpich/bin/mpirun” OR the mpich you build elsewhere?

I try to do the same way you did, but I met some error information when I run it.

pgf90 -Mmpi -fast -o test_cula test_cula.cuf -L/opt/pgi/linux86-64/2010/mpi/mpich/lib/ -L/opt/development/gpu/current/cula/lib64 -lcula_pgfortran -lcula -lcublas -L/opt/pgi/linux86-64/2011/acml/4.4.0/lib -lacml



-bash-3.2$ cat run3
/opt/pgi/linux86-64/2010/mpi/mpich/bin/mpirun -np 3 ./test_cula
-bash-3.2$ ./run3
p1_22929:  p4_error: interrupt SIGFPE: 8
p2_22997:  p4_error: interrupt SIGFPE: 8
p0_22918:  p4_error: interrupt SIGx: 13
p0_22918: (7.636719) net_send: could not write to fd=4, errno = 32

Do you know how to solve this problem?

Thank you in advance.