Hello,
I am testing the new CUBLAS ‘cublasZgetrfBatched’ function using CUDA Fortran.
I have a question about how to build an array of pointers with the device attribute.
This function performs the LU factorization of each Aarray _for i = 0, …, batchSize-1.
Aarray is an array of pointers to matrices stored in column-major format with dimensions n x n, as to the below spec… So Aarray is a pointer-of-pointer array like C.
cublasStatus_t cublasZgetrfBatched(cublasHandle_t handle,
int n,
cuDoubleComplex *Aarray[],
int lda,
int *PivotArray,
int *infoArray,
int batchSize);
I wrote the following simple test program for one reference.
I don’t know how to set up an array(Aarray_dev) of pointers with the device attribute.
Could you please tell me the way? And if you give me a correct code I will be much appreciated.
(GPU kard is K20 / cuda 5.5/ PGI 13.10 / CentOS 6.4)
module Zgetr_cublas
use cudafor
interface
function cublas_zgetrf(h, n, Aarray, lda, ipvt, info, ibatch) bind(c,name='cublasZgetrfBatched')
use iso_c_binding
use cublas
type(cublasHandle) :: h
integer(c_int), value :: n, lda, ibatch
complex(c_double_complex), device, pointer, dimension(:) :: Aarray(:)
!complex(c_double_complex), device, pointer :: Aarray
integer(c_int), device :: ipvt(:)
integer(c_int), device :: info(:)
end function cublas_zgetrf
end interface
end module Zgetr_cublas
program test
use cudafor
use cublas
use Zgetr_cublas
use, intrinsic :: iso_c_binding
implicit none
!.. Parameters ..
integer n
parameter (n=4)
integer lda, ldb
integer row, col, matrixsize
parameter (lda=n,ldb=n)
parameter (row=n,col=n)
integer ndmax ! = ibatch
parameter (ndmax=1)
!.. Local ..
integer :: i, j, k, istat
!.. Input Matrix for Host ..
complex(8), dimension(:,:,:), allocatable :: a
!.. for device
complex(8), device, pointer, dimension(:) :: Aarray_dev
!complex(8), device, pointer :: Aarray_dev ! test for ibatch=1
complex(8), device, target, dimension(:,:,:), allocatable :: a_d
!
!.. handle for CUBLAS function
type(cublasHandle) :: h1
integer :: ibatch
parameter (ibatch=ndmax)
integer :: ipvt(n*ibatch)
integer :: info(ibatch)
integer, device :: ipvt_d(n*ibatch)
integer, device :: info_d(ibatch)
! Matrixsize
matrixsize=col*row
print *, "size of variable a",sizeof(a(1,1,1)), "bytes"
print *, "size of matrix (row x col)", matrixsize
! allocate both a and a_d on device
allocate (a(row,col,ndmax))
allocate (a_d(row,col,ndmax))
! generates input matrix
do k=1, ndmax
a(1,1,k)=(-1.34d0, 2.55d0)
a(1,2,k)=( 0.28d0, 3.17d0)
a(1,3,k)=(-6.39d0,-2.20d0)
a(1,4,k)=( 0.72d0,-0.92d0)
a(2,1,k)=(-0.17d0,-1.41d0)
a(2,2,k)=( 3.31d0,-0.15d0)
a(2,3,k)=(-0.15d0, 1.34d0)
a(2,4,k)=( 1.29d0, 1.38d0)
a(3,1,k)=(-3.29d0,-2.39d0)
a(3,2,k)=(-1.91d0, 4.42d0)
a(3,3,k)=(-0.14d0,-1.35d0)
a(3,4,k)=( 1.72d0, 1.35d0)
a(4,1,k)=( 2.41d0, 0.39d0)
a(4,2,k)=(-0.56d0, 1.47d0)
a(4,3,k)=(-0.83d0,-0.69d0)
a(4,4,k)=(-1.96d0, 0.67d0)
end do
!a(1,1,2)=(-2.00d0, 2.55d0)
!a(1,1,3)=(-3.00d0, 2.55d0)
print *,'Input'
do k = 1, ndmax
print *, "Batch no. =", k
print '(1X,8D12.3)', ((a(i,j,k),j=1,col),i=1,row)
end do
! copy host to device
istat = cudaMemset(a_d, (0.0d0,0.0d0), matrixsize*ndmax)
!a_d = (0.0d0,0.0d0)
a_d = a
! build an array of pointers
Aarray_dev => a_d(1,1,:) ! runtime error due to device array section argument is not sequential
!Aarray_dev => a_d(1,1,1) ! test for ibatch=1
ipvt_d = ipvt
info_d = info
! call cuBlas
istat = cublasCreate(h1)
!h1 = cublasGetHandle()
istat= cublas_zgetrf(h1, n, Aarray_dev , lda, ipvt_d, info_d, ibatch)
print *, 'error code = ' ,istat, cudaGetErrorString( cudaGetLastError() )
istat = cublasDestroy(h1)
! initialize a() to verify
a = (0.0d0, 0.0d0)
! copy from Device to Host
a = a_d
print *, 'result'
! Print details of factorization
do k = 1, ndmax
print *, "Batch no. =", k
print '(1X,8D12.3)', ((a(i,j,k),j=1,col),i=1,row)
end do
! Print pivot indices
!print *, 'Pivot indices'
!print *, (ipiv(i),i=1,n)
! GPU device : reset
istat = cudaDeviceReset()
print *, 'error code = ', cudaGetErrorString( cudaGetLastError() )
end program
The compilation is successful, but a runtime error occurs as follows.
$ pgf90 -Mcuda:5.5 f.f90 -lcublas
$ a.out
size of variable a 16 bytes
size of matrix 16
Input
Batch no. = 1
-0.134D+01 0.255D+01 0.280D+00 0.317D+01 -0.639D+01 -0.220D+01 0.720D+00 -0.920D+00
-0.170D+00 -0.141D+01 0.331D+01 -0.150D+00 -0.150D+00 0.134D+01 0.129D+01 0.138D+01
-0.329D+01 -0.239D+01 -0.191D+01 0.442D+01 -0.140D+00 -0.135D+01 0.172D+01 0.135D+01
0.241D+01 0.390D+00 -0.560D+00 0.147D+01 -0.830D+00 -0.690D+00 -0.196D+01 0.670D+00
0: f.f90, line 105: device array section argument is not sequential
Line 105 is the part to set a pointer as the below.
! build an array of pointers
Aarray_dev => a_d(1,1,:)
This error might be due to a specification for Fortran pointer operation on device side.
I would like to know other any method to solve it.
Anyway, I applied some changes belows to run it avoiding the above runtime error. That is, I changed the program for batchsize=1.
!complex(8), device, pointer, dimension(:) :: Aarray_dev
complex(8), device, pointer :: Aarray_dev ! test for batchsize=1
---
!complex(c_double_complex), device, pointer, dimension(:) :: Aarray(:)
complex(c_double_complex), device, pointer :: Aarray
---
!Aarray_dev => a_d(1,1,:) ! runtime error due to device array section argument is not sequentia
Aarray_dev => a_d(1,1,1) ! test for batchsize=1
The execution looks normal, but the cublas function is not called. I can see this by the profiler “nvprof”. The symbol name of “cublasZgetrfBatched” does not appear in the profiling list. I don’t know why.
$ pgf90 -Mcuda:5.5 f.f90 -lcublas
$ nvprof a.out
==26712== NVPROF is profiling process 26712, command: a.out
size of variable a 16 bytes
size of matrix 16
Input
Batch no. = 1
-0.134D+01 0.255D+01 0.280D+00 0.317D+01 -0.639D+01 -0.220D+01 0.720D+00 -0.920D+00
-0.170D+00 -0.141D+01 0.331D+01 -0.150D+00 -0.150D+00 0.134D+01 0.129D+01 0.138D+01
-0.329D+01 -0.239D+01 -0.191D+01 0.442D+01 -0.140D+00 -0.135D+01 0.172D+01 0.135D+01
0.241D+01 0.390D+00 -0.560D+00 0.147D+01 -0.830D+00 -0.690D+00 -0.196D+01 0.670D+00
error code = 0
no error
result
Batch no. = 1
-0.134D+01 0.255D+01 0.280D+00 0.317D+01 -0.639D+01 -0.220D+01 0.720D+00 -0.920D+00
-0.170D+00 -0.141D+01 0.331D+01 -0.150D+00 -0.150D+00 0.134D+01 0.129D+01 0.138D+01
-0.329D+01 -0.239D+01 -0.191D+01 0.442D+01 -0.140D+00 -0.135D+01 0.172D+01 0.135D+01
0.241D+01 0.390D+00 -0.560D+00 0.147D+01 -0.830D+00 -0.690D+00 -0.196D+01 0.670D+00
error code =
no error
==26712== Profiling application: a.out
==26712== Profiling result:
Time(%) Time Calls Avg Min Max Name
59.91% 8.5120us 7 1.2160us 1.0880us 1.4720us [CUDA memcpy HtoD]
23.87% 3.3920us 1 3.3920us 3.3920us 3.3920us [CUDA memcpy DtoH]
16.22% 2.3040us 1 2.3040us 2.3040us 2.3040us __pgi_dev_cumemset_8
Thanks in advance for any help anybody could provide.
Best regards,
Kato