if I try to use multiple GPUs in mpi environment (each CPU has access to one GPU), device memory should be allocated after cudasetdevice(gpuid) or cula_selectdevice(gpuid). Other, copy data from CPUs to GPUs would result in error, because before the device id is specified, all the device memory allocated are on Device 0 as default.
If in my project, I want to use MPI, multiple GPUs, CULA device routines, CUDA Fortran Kernel function, can I set all the device variables in a moulde to make all of them global variables, then use this module in other subroutines?
I may not fully understand the question but I’ll do my best to answer.
if I try to use multiple GPUs in mpi environment (each CPU has access to one GPU), device memory should be allocated after cudasetdevice(gpuid) or cula_selectdevice(gpuid). Other, copy data from CPUs to GPUs would result in error, because before the device id is specified, all the device memory allocated are on Device 0 as default.
In 10.8 or later, the device context is created at the time of first use. So after you set the device number via cudasetdevice, the device will be associated the first time you allocate an array, copy data to the device, or launch a kernel. Note that you shouldn’t use cula_selectdevice since it will attempt to create a different context. For details on using CULA from CUDA Fortran, the following two articles may be helpful.
If in my project, I want to use MPI, multiple GPUs, CULA device routines, CUDA Fortran Kernel function, can I set all the device variables in a moulde to make all of them global variables, then use this module in other subroutines?
You can have device module data that is accessible to the host and all device subroutines within the same module. If you have a release 11.4 or later and a device with compute capability 2.0 or higher, device module data is also accessible by device routines in other modules.
Keep in mind though, that this data is not global across MPI processes. Each MPI process will have it’s own copy of the data so it’s up to your program to keep the data coherent. Though, this is normal for MPI programming.
I may not fully understand the question but I’ll do my best to answer.
In 10.8 or later, the device context is created at the time of first use. So after you set the device number via cudasetdevice, the device will be associated the first time you allocate an array, copy data to the device, or launch a kernel. Note that you shouldn’t use cula_selectdevice since it will attempt to create a different context. For details on using CULA from CUDA Fortran, the following two articles may be helpful.
The problem I met is that I have a simple test code which is MPI+CUDA Fortran+CULA routines.
When I have a code like this and make it run on 4 CPUs connected with 4 GPUs by PCIs.
The difference is one I set the dimension as sd(3) which is not allocatable before I call cula_selectdevice(), the other one I set sd() as allocatable before calling cula_selectdevice(), then allocate the memory after call the select device routine.
I guess the segmentation fault result from real,device :: sd(3) is trying to allocate memory on all the 4 devices (every GPU is supposed to have its own device array sd(3) ), but because devices have not been selected for the 4 CPUs.
This issue make me think about the other issue which has same MPI+CUDA Fortran+CULA and has segmentation fault as well in my project.
I have several device variables declared in a module as allocatable.
I just declare the subroutine laser_sub_1, but I don’t call it at all. It seems to affect my program’s execution. The program is running on 4 cores with 4 GPUs.
When I have those assignment statement to copy data form GPU to CPU, the program runs to the half then break down. Error info is segmentation fault, memory address not mapped.
When I comment out those assignment statement to copy data form GPU to CPU. the program runs perfect.
Because my project is trying to extend from a CPU version PGA code to a GPU version to boost the performance. There shouldn’t be any memory segmentation on CPU. So I guess it might be similar issue as the simple case I describe above.
But It still really confuse me, there should be no difference if I just put several lines there but don’t use them at all. But actually, it does.
The different is the later subroutine follow the steps to first, declared allocatable device variable, set devices for each CPU, allocate memory, copy from CPU to GPU. (The code is MPI code and this subroutine, still, it just be there, I don’t call it at all).
Thank you Mat, the thread seems very long for you. Thank you for your patience.
I haven’t done much with CULA myself, but I don’t think you want to use this function. I’m thinking it creates a separate context, performs it’s own initialization and either wipes out or prevents ours from being performed. I’m not sure, but I’d try just using cudaSetDevice and the CULA CUDA Fortran module (i.e. ‘use cula’) as shown in the articles I posted about.
module acm_dev is in a separate file. Global allocatable device variables.
What type GPUs do you have? In order to support this feature, you need the Unified Memory model available only on a Fermi card (i.e. compute capability 2.0).
ok for running CULA on multiple GPUs. An engineer of CULA replied it should work, and he said, culaSelectDevice is merely a passthrough for cudaSelectDevice so that you are not forced to invoke the CUDA toolkit if you only need the Host interface in CULA.
So I tried again to use cudasetdevice(gpuid) instead of cula_selectdevice(gpuid). But still the same problem occurred.
It brings me up the question that Does cula_initialize() create a separate context as well?
Does cula_initialize() create a separate context as well?
No this should be fine.
Maybe something else is happening. Are you trying to call the CULA host routines with device data, or CULA device routines with host data? Can you write-up a simple example that show your error?
To test cula_initialize, I took the following CULA example program and added some basic MPI calls. Granted, I’m only running on a single node with two GPUs, but it seems to work for me.
% cat test_cula.cuf
module cula_test
use cudafor
contains
! gpu error reporting routine
subroutine check_status(status)
integer status
integer info
integer cula_geterrorinfo
info = cula_geterrorinfo()
if (status .ne. 0) then
if (status .eq. 7) then
write(*,*) 'invalid value for parameter ', info
else if (status .eq. 8) then
write(*,*) 'data error (', info ,')'
else if (status .eq. 9) then
write(*,*) 'blas error (', info ,')'
else if (status .eq. 10) then
write(*,*) 'runtime error (', info ,')'
else
call cula_getstatusstring(status)
endif
stop 1
end if
end subroutine
! cpu test (baseline)
subroutine do_cpu_test(n,nrhs,ain,bin)
! input
real,dimension(:,:) :: ain,bin
! allocations
real,dimension(:,:),allocatable :: a,b,ans
integer,dimension(:),allocatable :: ipiv
integer n,nrhs
integer c1,c2,cr,cm
real norm
! back up input for reconstruction test
allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
a = ain
b = bin
! start test
call system_clock( c1, cr, cm )
print *, 'starting cpu test...'
! call lapack solver
call sgesv(n,nrhs,a,n,ipiv,b,n,info)
! stop test
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
print *, ' gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
! check answer
ans = bin;
call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
norm = slange('1',n,nrhs,ans,n,work) / real(n)
print *, ' error:', norm
print *, ''
! cleanup
deallocate(a,b,ipiv,ans)
end subroutine do_cpu_test
! cula test (host interface)
subroutine do_cula_host_test(n,nrhs,ain,bin)
! input
real,dimension(:,:) :: ain,bin
! allocations (all on host)
real,dimension(:,:),allocatable :: a,b,ans
integer,dimension(:),allocatable :: ipiv
integer n,nrhs,status
integer c1,c2,cr,cm
real norm
! back up input for reconstruction test
allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
a = ain
b = bin
! start test
call system_clock( c1,cr,cm )
print *, 'starting cula (host interface) test...'
! call cula solver (host interface)
status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
call check_status(status)
! stop test
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
print *, ' gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
! check answer
ans = bin;
call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
norm = slange('1',n,nrhs,ans,n,work) / real(n)
print *, ' error:', norm
print *, ''
! cleanup
deallocate(a,b,ipiv,ans)
end subroutine do_cula_host_test
! cula test (device interface)
subroutine do_cula_device_test(n,nrhs,ain,bin)
! input
real,dimension(:,:) :: ain,bin
! allocations (all on host)
real,dimension(:,:),allocatable :: a,b,ans
integer n,nrhs,status
integer c1,c2,cr,cm
real norm
! gpu memory
real,device,dimension(:,:),allocatable :: a_dev,b_dev
integer,device,dimension(:),allocatable :: ipiv_dev
! back up input for reconstruction test
allocate( a(n,n), b(n,nrhs), ans(n,nrhs) )
a(1:n,1:n) = ain
b(1:n,1:nrhs) = bin
! allocate gpu memory
allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )
! start test
call system_clock( c1,cr,cm )
print *, 'starting cula (device interface) test...'
! copy memory to gpu
a_dev = a
b_dev = b
! call cula solver (device interface)
status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
! copy answer to host
b = b_dev
! stop test
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
print *, ' gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
! check answer
ans(1:n,1:nrhs) = bin;
call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
norm = slange('1',n,nrhs,ans,n,work) / real(n)
print *, ' error:', norm
print *, ''
! cleanup
deallocate(a,b,ans)
deallocate(a_dev,b_dev,ipiv_dev)
end subroutine do_cula_device_test
end module cula_test
! main program
program cula
use cula_test
include 'mpif.h'
real error,eps
! Host memory
real,dimension(:,:),allocatable :: a, b
integer n, info, i, j, status
integer :: ierr, myid, numprocs, rc, mydev, numdev
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
ierr = cudaGetDeviceCount(numdev)
mydev = mod(myid,numdev)
print *, "Process ", myid, " of ", numprocs, " took GPU: ", mydev
ierr = cudaSetDevice(mydev)
n = 10000
nrhs = 1
print *,'cula + pgfortran test (matrix solve)'
print *,' array size: ', n, ' by ', n
print *,' right hand sides: ', nrhs
print *,''
allocate( a(n,n), b(n,nrhs) )
! intialize a and b
call random_number(a)
call random_number(b)
! Make sure a() isn't singular
do i=1,n
a(i,i) = 10. * a(i,i) + 10.
enddo
! initialize cula
status = cula_initialize()
call check_status(status)
! do cpu test (baseline)
call do_cpu_test(n,nrhs,a,b)
! do gpu test (host interface)
call do_cula_host_test(n,nrhs,a,b)
! do gpu test (device interface)
call do_cula_device_test(n,nrhs,a,b)
call MPI_FINALIZE(rc)
end program cula
% pgf90 -Mmpi -V11.7 -fast -o test_cula.out test_cula.cuf -L/usr/local/cula/lib64 -lcula_pgfortran -lcula -lcublas -L /usr/pgi/linux86-64/2011/acml/4.4.0/lib -lacml
% mpirun -machinefile machines.LINUX -np 4 test_cula.out
Process 0 of 4 took GPU: 0
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33464.90 ms
gflops: 19.72215
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 15007.96 ms
gflops: 43.97665
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 1733.560 ms
gflops: 380.7195
error: 4.5987219E-03
Process 3 of 4 took GPU: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33225.06 ms
gflops: 19.86452
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 6606.524 ms
gflops: 99.90125
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 8241.342 ms
gflops: 80.08405
error: 4.5987219E-03
Process 1 of 4 took GPU: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33354.10 ms
gflops: 19.78767
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 15385.24 ms
gflops: 42.89825
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 1710.793 ms
gflops: 385.7860
error: 4.5987219E-03
Process 2 of 4 took GPU: 0
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33443.48 ms
gflops: 19.73479
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 4807.774 ms
gflops: 137.2777
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 8166.234 ms
gflops: 80.82061
error: 4.5987219E-03
But I found a problem in the test result either in yours or mine.
The performance of cula_sgesv and cula_device_sgesv on one CPU with one GPU should be around 100 Gflops and 350 Gflops.
Everyone would think when it runs on multiple cores with multiple GPUs, the performance of the cula host and device routines should keep the same with when they are executed on one core with one GPU.
However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They’re supposed to keep.
Do you think it’s the problem of the cula routines or the devices?
I also did some test on CULA with the similar code that you use, and found the same problem and even worse performance dropping.
But I did some test on similar routines on MAGMA as well. Not performance dropping problem when it runs on multiple GPUs.
Below is my test result on 4 GPUs compared with it runs one 1 CPU with 1 GPU.
My test code is slightly different from yours, I put the MPI initialization module in a separate file. I compile mpi file with mpif90 (pf90 and openmpi implementation), compile the cuf code with pgfrotran.
1 CPU + 1 GPU:
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 36437.05 ms
gflops: 18.11343
error: 5.0552888E-03
starting cula (host interface) test...
runtime: 4183.666 ms
gflops: 157.7564
error: 1.2658446E-03
starting cula (device interface) test...
runtime: 3822.783 ms
gflops: 172.6491
error: 1.2658446E-03
4 CPUs + 4 GPUs:
Process 0 of 4 took GPU: 0
Process 3 of 4 took GPU: 3
Process 2 of 4 took GPU: 2
Process 1 of 4 took GPU: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
starting cpu test...
starting cpu test...
starting cpu test...
runtime: 35684.69 ms
gflops: 18.49533
error: 5.0552888E-03
runtime: 35932.07 ms
gflops: 18.36799
runtime: 35987.90 ms
gflops: 18.33949
error: 5.0552888E-03
error: 5.0552888E-03
runtime: 37404.25 ms
gflops: 17.64506
error: 5.0552888E-03
starting cula (host interface) test...
starting cula (host interface) test...
starting cula (host interface) test...
starting cula (host interface) test...
runtime: 7521.451 ms
gflops: 87.74902
error: 1.2658446E-03
runtime: 170592.8 ms
gflops: 3.868863
error: 1.2658446E-03
runtime: 178027.4 ms
runtime: 178026.6 ms
gflops: 3.707310
gflops: 3.707294
error: 1.2658446E-03
error: 1.2658446E-03
starting cula (device interface) test...
starting cula (device interface) test...
starting cula (device interface) test...
starting cula (device interface) test...
runtime: 241970.9 ms
gflops: 2.727601
error: 1.2658446E-03
runtime: 242817.9 ms
gflops: 2.718086
error: 1.2658446E-03
runtime: 245411.0 ms
gflops: 2.689365
error: 1.2658446E-03
runtime: 245853.4 ms
gflops: 2.684527
error: 1.2658446E-03
Compared with 157.75 and 172.64 Gflops on single CPU with single GPU, when it runs on 4 GPUs, it drops to 3 Gflops and 2 Gflops for host and device cula sgesv routines.
I’m working on a project which’ll use this cula routines with MPI+multiple GPUs. Since this issue, this definitely won’t get any advantage to use CULA routines instead of same routines in MKL (The MKL routines I test is 18 Gflops).
Do you have any experience on this? I’ll really appreciate if you can give me some advice.
My test code:
test_cula.cuf
module cula_test
use cudafor
contains
! gpu error reporting routine
subroutine check_status(status)
integer status
integer info
integer cula_geterrorinfo
info = cula_geterrorinfo()
if (status .ne. 0) then
if (status .eq. 7) then
write(*,*) 'invalid value for parameter ', info
else if (status .eq. 8) then
write(*,*) 'data error (', info ,')'
else if (status .eq. 9) then
write(*,*) 'blas error (', info ,')'
else if (status .eq. 10) then
write(*,*) 'runtime error (', info ,')'
else
call cula_getstatusstring(status)
endif
stop 1
end if
end subroutine
! cpu test (baseline)
subroutine do_cpu_test(n,nrhs,ain,bin)
! input
real,dimension(:,:) :: ain,bin
! allocations
real,dimension(:,:),allocatable :: a,b,ans
integer,dimension(:),allocatable :: ipiv
integer n,nrhs
integer c1,c2,cr,cm
real norm
! back up input for reconstruction test
allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
a = ain
b = bin
! start test
call system_clock( c1, cr, cm )
print *, 'starting cpu test...'
! call lapack solver
call sgesv(n,nrhs,a,n,ipiv,b,n,info)
! stop test
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
print *, ' gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
! check answer
ans = bin;
call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
norm = slange('1',n,nrhs,ans,n,work) / real(n)
print *, ' error:', norm
print *, ''
! cleanup
deallocate(a,b,ipiv,ans)
end subroutine do_cpu_test
! cula test (host interface)
subroutine do_cula_host_test(n,nrhs,ain,bin)
! input
real,dimension(:,:) :: ain,bin
! allocations (all on host)
real,dimension(:,:),allocatable :: a,b,ans
integer,dimension(:),allocatable :: ipiv
integer n,nrhs,status
integer c1,c2,cr,cm
real norm
! back up input for reconstruction test
allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
a = ain
b = bin
! start test
call system_clock( c1,cr,cm )
print *, 'starting cula (host interface) test...'
! call cula solver (host interface)
status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
call check_status(status)
! stop test
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
print *, ' gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
! check answer
ans = bin;
call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
norm = slange('1',n,nrhs,ans,n,work) / real(n)
print *, ' error:', norm
print *, ''
! cleanup
deallocate(a,b,ipiv,ans)
end subroutine do_cula_host_test
! cula test (device interface)
subroutine do_cula_device_test(n,nrhs,ain,bin)
! input
real,dimension(:,:) :: ain,bin
! allocations (all on host)
real,dimension(:,:),allocatable :: a,b,ans
integer n,nrhs,status
integer c1,c2,cr,cm
real norm
! gpu memory
real,device,dimension(:,:),allocatable :: a_dev,b_dev
integer,device,dimension(:),allocatable :: ipiv_dev
! back up input for reconstruction test
allocate( a(n,n), b(n,nrhs), ans(n,nrhs) )
a(1:n,1:n) = ain
b(1:n,1:nrhs) = bin
! allocate gpu memory
allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )
! start test
call system_clock( c1,cr,cm )
print *, 'starting cula (device interface) test...'
! copy memory to gpu
a_dev = a
b_dev = b
! call cula solver (device interface)
status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
! copy answer to host
b = b_dev
! stop test
call system_clock( count=c2 )
print *, ' runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
print *, ' gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
! check answer
ans(1:n,1:nrhs) = bin;
call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
norm = slange('1',n,nrhs,ans,n,work) / real(n)
print *, ' error:', norm
print *, ''
! cleanup
deallocate(a,b,ans)
deallocate(a_dev,b_dev,ipiv_dev)
end subroutine do_cula_device_test
end module cula_test
! main program
program cula
use cula_test
use more_mpi
! Host memory
real,dimension(:,:),allocatable :: a, b
integer n, info, i, j, status
integer :: rc, mydev, numdev
call init
ierr = cudaGetDeviceCount(numdev)
mydev = mod(cpuid,numdev)
print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev
!ierr = cudaSetDevice(mydev)
ierr = cula_selectdevice(mydev)
call check_status(ierr)
n = 10000
nrhs = 1
print *,'cula + pgfortran test (matrix solve)'
print *,' array size: ', n, ' by ', n
print *,' right hand sides: ', nrhs
print *,''
allocate( a(n,n), b(n,nrhs) )
! intialize a and b
call random_number(a)
call random_number(b)
! Make sure a() isn't singular
do i=1,n
a(i,i) = 10. * a(i,i) + 10.
enddo
call MPI_Barrier(mpi_comm_world,ierr)
! initialize cula
status = cula_initialize()
call check_status(status)
! do cpu test (baseline)
call do_cpu_test(n,nrhs,a,b)
call MPI_Barrier(mpi_comm_world,ierr)
! do gpu test (host interface)
call do_cula_host_test(n,nrhs,a,b)
call MPI_Barrier(mpi_comm_world,ierr)
! do gpu test (device interface)
call do_cula_device_test(n,nrhs,a,b)
call MPI_Barrier(mpi_comm_world,ierr)
call MPI_FINALIZE(rc)
end program cula
init.f
subroutine init
use more_mpi
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
call mpi_comm_size(mpi_comm_world,numprocs,ierr)
call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine init
more_mpi.f
module more_mpi
include 'mpif.h'
integer :: ierr,cpuid,numprocs,namelen !mpi
character(len=100) processor_name
end module
However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They’re supposed to keep.
This is because I was running 4 MPI processes but only have 2 GPUs. When I run with only 2 MPI processes, I get the expected 350 Gflops.
With your test code, I get similar results where there is a slight slow down going from 1 to 2 MPI process, and a much larger slow down when going to 4 since I only have 2 GPUs. Why your seeing such a large slow down I’m not sure. Granted, I’m using MPICH rather than OpenMPI, but I doubt that’s the cause.
Mat
% mpirun -machinefile machines.LINUX -np 1 test_cula.out
Process 0 of 1 took GPU: 0
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 32975.53 ms
gflops: 20.01484
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 1804.901 ms
gflops: 365.6710
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 1691.823 ms
gflops: 390.1117
error: 4.5987219E-03
% mpirun -machinefile machines.LINUX -np 2 test_cula.out
Process 0 of 2 took GPU: 0
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 32948.79 ms
gflops: 20.03108
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 1997.535 ms
gflops: 330.4072
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 1791.703 ms
gflops: 368.3646
error: 4.5987219E-03
Process 1 of 2 took GPU: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33058.83 ms
gflops: 19.96441
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 1940.003 ms
gflops: 340.2057
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 1760.786 ms
gflops: 374.8326
error: 4.5987219E-03
% mpirun -machinefile machines.LINUX -np 4 test_cula.out
Process 0 of 4 took GPU: 0
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33447.90 ms
gflops: 19.73218
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 7527.005 ms
gflops: 87.68427
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 4662.672 ms
gflops: 141.5497
error: 4.5987219E-03
Process 2 of 4 took GPU: 0
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33480.27 ms
gflops: 19.71311
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 7160.245 ms
gflops: 92.17562
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 3568.168 ms
gflops: 184.9689
error: 4.5987219E-03
Process 1 of 4 took GPU: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33291.96 ms
gflops: 19.82460
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 6522.568 ms
gflops: 101.1871
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 6788.867 ms
gflops: 97.21799
error: 4.5987219E-03
Process 3 of 4 took GPU: 1
cula + pgfortran test (matrix solve)
array size: 10000 by 10000
right hand sides: 1
starting cpu test...
runtime: 33557.15 ms
gflops: 19.66794
error: 4.8381262E-03
starting cula (host interface) test...
runtime: 8348.127 ms
gflops: 79.05965
error: 4.5987219E-03
starting cula (device interface) test...
runtime: 6695.790 ms
gflops: 98.56940
error: 4.5987219E-03