Using CUDA Libraries from CUDA Fortran Device Code

YuanYU · July 18, 2017, 5:27pm

I want to using the cuBLAS library in the device code to make matrix multiple. From the manual http://www.pgroup.com/doc/pgi17cudaint.pdf, we know it is callable the cuBLAS libraries from CUDA Fortran device
code. however, I obtain an error as:

PGF90-S-0155-Calls from device code to a host subroutine are allowed only in emulation mode

my code is given below:

module precision
Integer,parameter::single=kind(0.0)
integer,parameter::double=kind(0.0d0)

integer,parameter:: fp_kind=double
!integer,parameter:: fp_kind=single
end module precision

Program gemm_test
use precision
use cublas
integer,parameter:: m=9,n=9,k=1
real(fp_kind):: a_h(m,n),b_h(n,k),c_h(m,k)  !real(fp_kind):: a(m,n),b(n),c(m)
real(fp_kind),device:: a(m,n),b(n,k),c(m,k)

call solve_C <<<1,1>>>(c)
c_h=c
print*,c_h

end program

attributes(global) subroutine solve_c(c)
use precision
use cublas_device
implicit none
integer,parameter:: m=9,n=9,k=1
real:: a(m,n),b(n,k),c(m,k)

a=1
b=2

call sgemm ('N','N',m,k,n,1.0,a,m,b,n,0.0,c,k)
return
end subroutine

 subroutine sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
 character*1 :: transa, transb
 integer :: m, n, k, lda, ldb, ldc
 real(4), device, dimension(lda, *) :: a ! device or host variable
 real(4), device, dimension(ldb, *) :: b ! device or host variable
 real(4), device, dimension(ldc, *) :: c ! device or host variable
 real(4), device :: alpha, beta ! device or host variable
 end subroutine

run it by:

pgfortran -Mcuda=cc35 gemm.cuf -lcublas_device

MatColgrove · July 18, 2017, 10:00pm

Hi YuanYU,

You need to call “cublasSgemm” from the device. I updated your code below with the change as well as a few other things such as putting the solve_c call into a module.

Note that you can find examples of calling cuBLAS from the device in the example directory that ships with the compilers: “$PGI/linux86-64/2017/examples/CUDA-Libraries/cuBLAS/test_blas_cuf”

Also, you can find the interface for cublasSgemm in the PGI Fortran CUDA Library Interfaces Guide’s section 2.6.1.30 (See: http://www.pgroup.com/doc/pgi17cudaint.pdf).

module precision
Integer,parameter::single=kind(0.0)
integer,parameter::double=kind(0.0d0)

!integer,parameter:: fp_kind=double
integer,parameter:: fp_kind=single
end module precision

module gemm

contains

attributes(global) subroutine solve_c(a,b,c,m,n,k)
use precision
use cublas_device
implicit none
integer, value :: m,n,k
real(fp_kind):: a(m,k),b(k,n),c(m,n)
type(cublasHandle) :: h
integer :: lda, ldb, ldc
real(fp_kind), parameter :: alpha = 1.0, beta = 0.0
integer :: rc
lda = m
ldb = k
ldc = m

!SGEMM Interface from Section 2.6.1.30 of
! http://www.pgroup.com/doc/pgi17cudaint.pdf
! integer(4) function cublasSgemm(h, transa, transb, m, n, k, alpha, a, lda, b,
!                                 ldb, beta, c, ldc)
! type(cublasHandle) :: h
! integer :: transa, transb
! integer :: m, n, k, lda, ldb, ldc
! real(4), device, dimension(lda, *) :: a
! real(4), device, dimension(ldb, *) :: b
! real(4), device, dimension(ldc, *) :: c
! real(4), device :: alpha, beta

rc = cublasCreate(h)
rc = cublasSgemm (h,CUBLAS_OP_N, CUBLAS_OP_N,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
rc = cublasDestroy(h)
return
end subroutine
end module gemm


Program gemm_test
use precision
use cublas
use gemm
integer,parameter:: m=9,n=9,k=1
real(fp_kind) :: c_h(m,n)
real(fp_kind), device:: a(m,k),b(k,n),c(m,n)
a=1
b=2
c=-1
call solve_C <<<1,1>>>(a,b,c,m,n,k)
c_h=c
print*,c_h

end program

Hope this helps,
Mat

YuanYU · July 19, 2017, 10:38am

Hi, Mat

Thank you for your kind help. I test this code, but the kernel seems not work,and wrong result is obtained. I can’t find the bug and what happened, help me please. ^_^

Best regard.
Yuan

mkcolg:

Hi YuanYU,

You need to call “cublasSgemm” from the device. I updated your code below with the change as well as a few other things such as putting the solve_c call into a module.

Note that you can find examples of calling cuBLAS from the device in the example directory that ships with the compilers: “$PGI/linux86-64/2017/examples/CUDA-Libraries/cuBLAS/test_blas_cuf”

Also, you can find the interface for cublasSgemm in the PGI Fortran CUDA Library Interfaces Guide’s section 2.6.1.30 (See: PGI Documentation Archive for Versions Prior to 17.7).

module precision
Integer,parameter::single=kind(0.0)
integer,parameter::double=kind(0.0d0)

!integer,parameter:: fp_kind=double
integer,parameter:: fp_kind=single
end module precision

module gemm

contains

attributes(global) subroutine solve_c(a,b,c,m,n,k)
use precision
use cublas_device
implicit none
integer, value :: m,n,k
real(fp_kind):: a(m,k),b(k,n),c(m,n)
type(cublasHandle) :: h
integer :: lda, ldb, ldc
real(fp_kind), parameter :: alpha = 1.0, beta = 0.0
integer :: rc
lda = m
ldb = k
ldc = m

!SGEMM Interface from Section 2.6.1.30 of
! http://www.pgroup.com/doc/pgi17cudaint.pdf
! integer(4) function cublasSgemm(h, transa, transb, m, n, k, alpha, a, lda, b,
!                                 ldb, beta, c, ldc)
! type(cublasHandle) :: h
! integer :: transa, transb
! integer :: m, n, k, lda, ldb, ldc
! real(4), device, dimension(lda, *) :: a
! real(4), device, dimension(ldb, *) :: b
! real(4), device, dimension(ldc, *) :: c
! real(4), device :: alpha, beta

rc = cublasCreate(h)
rc = cublasSgemm (h,CUBLAS_OP_N, CUBLAS_OP_N,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
rc = cublasDestroy(h)
return
end subroutine
end module gemm


Program gemm_test
use precision
use cublas
use gemm
integer,parameter:: m=9,n=9,k=1
real(fp_kind) :: c_h(m,n)
real(fp_kind), device:: a(m,k),b(k,n),c(m,n)
a=1
b=2
c=-1
call solve_C <<<1,1>>>(a,b,c,m,n,k)
c_h=c
print*,c_h

end program

Hope this helps,
Mat

MatColgrove · July 19, 2017, 2:54pm

What output did you get? The results should print out “2.0”.

I did switch your m, n, and k dimensions around so they matched the cublasSgemm interface names. So if you were looking for “18.0” as the answer, set n=1 and k=9.

If you’re getting “-1”, then that means that the cublasSgemm call failed for some reason. What device are you using? You’re compiling to target a CC35 device. If it’s a P100, then you need to compile with “cc60” instead.

-Mat

YuanYU · July 19, 2017, 3:58pm

Hi, Mat

I get “-1” as the results, and when i compiling by CC35 or CC 60, the message come out:

ptxas info    : 'device-function-maxrregcount' is a BETA feature

I run

 pgaccelinfo

and obtain:

CUDA Driver Version:           8000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017

Device Number:                 0
Device Name:                   GeForce GTX 960M
Device Revision Number:        5.0
Global Memory Size:            4240965632
Number of Multiprocessors:     5
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1176 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             2505 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 2097152 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc50

I have just compiling it with CC50, and right answer come out, but many message come out:

ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature

Maybe it is OK now, thank you very much.

Best regards
Yuan

[/code]

MatColgrove · July 19, 2017, 4:01pm

You have a Maxwell device so you need to use “-Mcuda=cc50”. Note that by default “-Mcuda” by itself will target CC30, CC35, and CC50.

Device Name: GeForce GTX 960M
Device Revision Number: 5.0
…
PGI Compiler Option: -ta=tesla:cc50

YuanYU · July 19, 2017, 4:11pm

OK, that’s great! Thank you very much.

Topic		Replies	Views
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4825	April 7, 2021
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5243	May 13, 2013
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2179	June 8, 2021
How to call cublas library into my cuda fortran code? Legacy PGI Compilers	5	5935	December 8, 2011
Undefined reference to cublas_ -- but library is there! Legacy PGI Compilers	5	12475	September 1, 2010
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	935	August 23, 2018
how to use cublasSgemm_v2 ? Legacy PGI Compilers	11	12163	December 7, 2016
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7385	January 11, 2020
cublasSgemm results in null matrix CUDA Programming and Performance	5	758	May 28, 2019
cublasSgemm() alway fail during compute intensify task CUDA Programming and Performance	14	4551	January 8, 2015

Using CUDA Libraries from CUDA Fortran Device Code

Related topics