how to use cublasSgemm_v2 ?

cyfengMIT · November 22, 2016, 12:37am

Hi, all

I try to call cublas from kernels and make a simple example as below.
But the PVF 14.7 compiler gave an error message…

"Unresolved extern function ‘cublasSgemm_v2’ "

I have tried linking with cublas.lib, cublas_device.lib… but in vain
which library should I use?
thanks in advance.

cy

      program prog
      implicit none
      real, device :: dA(4,2), dB(4,4)
      real :: A(4,2), B(4,4)
      
      A(:,1)=1.
      A(:,2)=0.5
      
      dA=A      
      call testKernelCallCublas<<<1,16>>>(dA,4,2,dB)      
      B=dB
      
      print*,B(:,1)
      print*,B(:,2)
      print*,B(:,3)
      print*,B(:,4)

      end program prog

      attributes(global) subroutine testKernelCallCublas(AA,N,M,BB)
      use cublas_device
      implicit none
      integer, value :: N, M
      real, device :: AA(N,M), BB(N,N)
      integer, device :: istat, transa, transb
      type(cublasHandle) :: handle
      
      if( threadidx%x==1 )then
        transa=0
        transb=1
        istat= cublasSgemm_v2(handle, transa, transb,4,4,2,1.0,AA,4,AA,4,0.,BB,4)        
      end if
      
      end subroutine testKernelCallCublas

[/code]

MatColgrove · November 22, 2016, 5:25pm

Hi Cy,

We didn’t support the V2 interfaces until the 15.9 release and have since made them the default for devices with compute capability 3.5 or higher (i.e. no need to explicitly call “_v2”). To use the interfaces, you will need a CC3.5 device such as a Tesla K40, as well.

For example, using 16.10:

PGI$ cat testv2.cuf
module testBLAS
    use cublas_device
    contains

      attributes(global) subroutine testKernelCallCublas(AA,N,M,BB)
       implicit none
       integer, value :: N, M
       real, device :: AA(N,M), BB(N,N)
       integer, device :: istat, transa, transb
       type(cublasHandle) :: handle

       if( threadidx%x==1 )then
         istat = cublasCreate(handle)
         transa=0
         transb=1
         istat= cublasSgemm(handle,transa,transb,4,4,2,1.0,AA,4,AA,4,0.,BB,4)

         istat = cublasDestroy(handle)
       end if

       end subroutine testKernelCallCublas

end module testBLAS

       program prog
       use testBLAS
       implicit none
       real, device :: dA(4,2), dB(4,4)
       real :: A(4,2), B(4,4)

       A(:,1)=1.
       A(:,2)=0.5

       dA=A
       call testKernelCallCublas<<<1,16>>>(dA,4,2,dB)
       B=dB

       print*,B(:,1)
       print*,B(:,2)
       print*,B(:,3)
       print*,B(:,4)

       end program prog

PGI$ pgfortran -Mcuda=cc35 -defaultlib=cublas_device -fast testv2.cuf
PGI$ ./testv2.exe
    1.250000        1.250000        1.250000        1.250000
    1.250000        1.250000        1.250000        1.250000
    1.250000        1.250000        1.250000        1.250000
    1.250000        1.250000        1.250000        1.250000

Hope this helps,
Mat

cyfengMIT · November 22, 2016, 10:54pm

Hi Mat,

Appreciate your kind suggestion. I have GTX780Ti (CC3.5, on linux) and GTX850M (CC5.0, on laptop) gpu cards. Can I use them with PGI Community edition to develop CUDA code? Or the Tesla series gpu is needed?
Thank you very much.

CY

MatColgrove · November 23, 2016, 6:37pm

Hi Cy,

Yes, you can use both cards with the Community edition on Linux. I wouldn’t recommend using GTX cards for production code since they don’t have error correcting memory, but for development they should be fine. Windows can be problematic since the GTX cards use the WDDM driver, but on Linux they use the TCC driver.

Mat

cyfengMIT · November 25, 2016, 4:38am

Hi Mat,

Thanks for your reply. Would you recommend some material about ‘Error Correcting Memory’ ? Is it the same as ‘Error Collection Correcting’ ? And how to figure out the effect of ECC upon the results?

CY

MatColgrove · November 28, 2016, 4:47pm

Is it the same as ‘Error Collection Correcting’ ?

Yes, I’m meaning ECC.

And how to figure out the effect of ECC upon the results?

With ECC enables, the random bit flip that seldom does but can occur would be caught and fixed before it effects your results. For graphics, it doesn’t matter if a few pixels are off for a few frames but can matter for scientific applications.

Mat

cyfengMIT · December 2, 2016, 5:59am

Hi Mat,

I modified the dA & dB matrix as allocatable ones and found something weird.
After allocating device matrix, the VIRT increased abruptly to 63,400,828 Kb !!

[root@manysplendid ~]# ps aux | grep testCublas.x
root 1794 0.0 0.0 246412 7656 pts/5 S+ 13:37 0:00 ./testCublas.x
root 1796 0.0 0.0 105452 932 pts/4 S+ 13:37 0:00 grep testCublas.x
[root@manysplendid ~]# ps aux | grep testCublas.x
root 1794 2.5 0.6 63400828 199280 pts/5 Sl+ 13:37 0:00 ./testCublas.x
root 1800 0.0 0.0 105452 932 pts/4 S+ 13:37 0:00 grep testCublas.x
[root@manysplendid ~]#

The PGI version is 16.10 and GPU is GTX 780 Ti. OS is CentOs 6.6.
Following are the code and compile command.
Is there something wrong?
Thank you

CY

module testBLAS 
    use cublas_device 
    contains 

      attributes(global) subroutine testKernelCallCublas(AA,N,M,BB) 
       implicit none 
       integer, value :: N, M 
       real, device :: AA(N,M), BB(N,N) 
       integer, device :: istat, transa, transb 
       type(cublasHandle) :: handle 

       if( threadidx%x==1 )then 
         istat = cublasCreate(handle) 
         transa=0 
         transb=1 
         istat= cublasSgemm(handle,transa,transb,4,4,2,1.0,AA,4,AA,4,0.,BB,4) 

         istat = cublasDestroy(handle) 
       end if 

       end subroutine testKernelCallCublas 

end module testBLAS 

       program prog 
       use testBLAS 
       implicit none 
       real, allocatable, device :: dA(:,:), dB(:,:) 
       real :: A(4,2), B(4,4) 

       A(:,1)=1. 
       A(:,2)=0.5 
pause
       allocate(dA(4,2), dB(4,4))
print*, "after allocating"
pause
       dA=A 
       call testKernelCallCublas<<<1,16>>>(dA,4,2,dB) 
       B=dB 

       print*,B(:,1) 
       print*,B(:,2) 
       print*,B(:,3) 
       print*,B(:,4) 

       deallocate(dA, dB)
       end program prog

[root@manysplendid cublas]# pgfortran -Mcuda=cc3.5 -lcublas_device test2.cuf -o testCublas.x
nvlink warning : SM Arch (‘sm_35’) not found in ‘/opt/pgi/linux86-64/2016/cuda/7.0/lib64/libcublas_device.a:maxwell_sgemm.asm.o’
nvlink warning : SM Arch (‘sm_35’) not found in ‘/opt/pgi/linux86-64/2016/cuda/7.0/lib64/libcublas_device.a:maxwell_sm50_sgemm.o’
nvlink warning : SM Arch (‘sm_35’) not found in ‘/opt/pgi/linux86-64/2016/cuda/7.0/lib64/libcublas_device.a:maxwell_sm50_ssyrk.o’
[root@manysplendid cublas]# ./testCublas.x
FORTRAN PAUSE: enter or d to continue>
after allocating
FORTRAN PAUSE: enter or d to continue>
1.250000 1.250000 1.250000 1.250000
1.250000 1.250000 1.250000 1.250000
1.250000 1.250000 1.250000 1.250000
1.250000 1.250000 1.250000 1.250000
[root@manysplendid cublas]#

MatColgrove · December 2, 2016, 5:43pm

Hi CY,

Sorry but I don’t know where the extra memory usage would be coming from. My best guess is that cuBlas is adding some extra space when the CUDA context is created which happens to get created when you allocate your device arrays (the context is created at first use of the device).

Mat

cyfengMIT · December 2, 2016, 11:09pm

Hi Mat,

Thanks for your response. I removed all the codes relative to cublas and compiled it with -Mcuda.

[root@manysplendid cublas]# cat test3.f90
       program prog
       implicit none
       real, allocatable, device :: dA(:,:), dB(:,:)
       real :: A(4,2), B(4,4)

       A(:,1)=1.
       A(:,2)=0.5
pause
       allocate(dA(4,2), dB(4,4))
print*, "after allocating"
pause
       dA=A
       B=dB

       print*,B(:,1)
       print*,B(:,2)
       print*,B(:,3)
       print*,B(:,4)

       deallocate(dA, dB)
       end program prog

[root@manysplendid cublas]# pgf90 -Mcuda test3.f90 -o test3.x
[root@manysplendid cublas]# ./test3.x

However, I got the same situation… @@

[root@manysplendid ~]# ps aux | grep test3.x
root 27246 1.0 0.3 63349356 116948 pts/0 Sl+ 07:03 0:00 ./test3.x
root 27367 0.0 0.0 105452 936 pts/2 S+ 07:04 0:00 grep test3.x
[root@manysplendid ~]#

CY

MatColgrove · December 5, 2016, 4:11pm

Looks like it’s the context creation which happens upon first use of the device, which in this case is the allocation of the arrays. If you add some device code before the allocation, you’ll see the virtual memory created at that time.

Mat

cyfengMIT · December 7, 2016, 3:28am

Hi Mat,

Yes…
I added ‘j=cudaGetDeviceCount(i)’ before allocating device array and the VIRT jumped to 63163476. Thank you. But, it didn’t happen to the previous versions of compiler… is it normal ?

CY

MatColgrove · December 7, 2016, 6:22pm

Hi Cy,

I spot checked PGI versions all the way back to 12.10 and see the same behavior.

I don’t have a way to roll back CUDA Driver version so can’t test, but since context creation is done by the driver, different drivers could be using differing amounts of virtual memory.

Mat

Topic		Replies	Views
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4757	April 7, 2021
How to call cublas library into my cuda fortran code? Legacy PGI Compilers	5	5927	December 8, 2011
Using CUDA Libraries from CUDA Fortran Device Code Legacy PGI Compilers	6	7434	July 19, 2017
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5239	May 13, 2013
Undefined reference to cublas_ -- but library is there! Legacy PGI Compilers	5	12444	September 1, 2010
cannot compile with cublas code in kernel CUDA Programming and Performance	4	990	July 8, 2019
Problem running test_cublas sample Legacy PGI Compilers	7	6256	May 2, 2013
cublasSgemm results in null matrix CUDA Programming and Performance	5	755	May 28, 2019
cublasZgemm fails on FERMI but not on TESLA CUBLAS_STATUS_NOT_INITIALIZED even if 'cublasInit()& CUDA Programming and Performance	2	5906	February 17, 2011
Can a cuda kernel call CUBLAS function or how to call a cublas functions from Python ? CUDA Programming and Performance	7	3615	November 15, 2019

how to use cublasSgemm_v2 ?

Related topics