Using CUDA Libraries from CUDA Fortran Device Code

I want to using the cuBLAS library in the device code to make matrix multiple. From the manual, we know it is callable the cuBLAS libraries from CUDA Fortran device
code. however, I obtain an error as:

PGF90-S-0155-Calls from device code to a host subroutine are allowed only in emulation mode

my code is given below:

module precision

integer,parameter:: fp_kind=double
!integer,parameter:: fp_kind=single
end module precision

Program gemm_test
use precision
use cublas
integer,parameter:: m=9,n=9,k=1
real(fp_kind):: a_h(m,n),b_h(n,k),c_h(m,k)  !real(fp_kind):: a(m,n),b(n),c(m)
real(fp_kind),device:: a(m,n),b(n,k),c(m,k)

call solve_C <<<1,1>>>(c)

end program

attributes(global) subroutine solve_c(c)
use precision
use cublas_device
implicit none
integer,parameter:: m=9,n=9,k=1
real:: a(m,n),b(n,k),c(m,k)


call sgemm ('N','N',m,k,n,1.0,a,m,b,n,0.0,c,k)
end subroutine

 subroutine sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
 character*1 :: transa, transb
 integer :: m, n, k, lda, ldb, ldc
 real(4), device, dimension(lda, *) :: a ! device or host variable
 real(4), device, dimension(ldb, *) :: b ! device or host variable
 real(4), device, dimension(ldc, *) :: c ! device or host variable
 real(4), device :: alpha, beta ! device or host variable
 end subroutine

run it by:

pgfortran -Mcuda=cc35 gemm.cuf -lcublas_device

Hi YuanYU,

You need to call “cublasSgemm” from the device. I updated your code below with the change as well as a few other things such as putting the solve_c call into a module.

Note that you can find examples of calling cuBLAS from the device in the example directory that ships with the compilers: “$PGI/linux86-64/2017/examples/CUDA-Libraries/cuBLAS/test_blas_cuf”

Also, you can find the interface for cublasSgemm in the PGI Fortran CUDA Library Interfaces Guide’s section (See:

module precision

!integer,parameter:: fp_kind=double
integer,parameter:: fp_kind=single
end module precision

module gemm


attributes(global) subroutine solve_c(a,b,c,m,n,k)
use precision
use cublas_device
implicit none
integer, value :: m,n,k
real(fp_kind):: a(m,k),b(k,n),c(m,n)
type(cublasHandle) :: h
integer :: lda, ldb, ldc
real(fp_kind), parameter :: alpha = 1.0, beta = 0.0
integer :: rc
lda = m
ldb = k
ldc = m

!SGEMM Interface from Section of
! integer(4) function cublasSgemm(h, transa, transb, m, n, k, alpha, a, lda, b,
!                                 ldb, beta, c, ldc)
! type(cublasHandle) :: h
! integer :: transa, transb
! integer :: m, n, k, lda, ldb, ldc
! real(4), device, dimension(lda, *) :: a
! real(4), device, dimension(ldb, *) :: b
! real(4), device, dimension(ldc, *) :: c
! real(4), device :: alpha, beta

rc = cublasCreate(h)
rc = cublasSgemm (h,CUBLAS_OP_N, CUBLAS_OP_N,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
rc = cublasDestroy(h)
end subroutine
end module gemm

Program gemm_test
use precision
use cublas
use gemm
integer,parameter:: m=9,n=9,k=1
real(fp_kind) :: c_h(m,n)
real(fp_kind), device:: a(m,k),b(k,n),c(m,n)
call solve_C <<<1,1>>>(a,b,c,m,n,k)

end program

Hope this helps,

Hi, Mat

Thank you for your kind help. I test this code, but the kernel seems not work,and wrong result is obtained. I can’t find the bug and what happened, help me please. ^_^

Best regard.

What output did you get? The results should print out “2.0”.

I did switch your m, n, and k dimensions around so they matched the cublasSgemm interface names. So if you were looking for “18.0” as the answer, set n=1 and k=9.

If you’re getting “-1”, then that means that the cublasSgemm call failed for some reason. What device are you using? You’re compiling to target a CC35 device. If it’s a P100, then you need to compile with “cc60” instead.


Hi, Mat

I get “-1” as the results, and when i compiling by CC35 or CC 60, the message come out:

ptxas info    : 'device-function-maxrregcount' is a BETA feature

I run


and obtain:

CUDA Driver Version:           8000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017

Device Number:                 0
Device Name:                   GeForce GTX 960M
Device Revision Number:        5.0
Global Memory Size:            4240965632
Number of Multiprocessors:     5
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1176 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             2505 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 2097152 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc50

I have just compiling it with CC50, and right answer come out, but many message come out:

ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature
ptxas info    : 'device-function-maxrregcount' is a BETA feature

Maybe it is OK now, thank you very much.

Best regards


You have a Maxwell device so you need to use “-Mcuda=cc50”. Note that by default “-Mcuda” by itself will target CC30, CC35, and CC50.

Device Name: GeForce GTX 960M
Device Revision Number: 5.0

PGI Compiler Option: -ta=tesla:cc50

OK, that’s great! Thank you very much.