how to use cublasSgemm_v2 ?

Hi, all

I try to call cublas from kernels and make a simple example as below.
But the PVF 14.7 compiler gave an error message…

"Unresolved extern function ‘cublasSgemm_v2’ "

I have tried linking with cublas.lib, cublas_device.lib… but in vain
which library should I use?
thanks in advance.

cy

      program prog
      implicit none
      real, device :: dA(4,2), dB(4,4)
      real :: A(4,2), B(4,4)
      
      A(:,1)=1.
      A(:,2)=0.5
      
      dA=A      
      call testKernelCallCublas<<<1,16>>>(dA,4,2,dB)      
      B=dB
      
      print*,B(:,1)
      print*,B(:,2)
      print*,B(:,3)
      print*,B(:,4)

      end program prog

      attributes(global) subroutine testKernelCallCublas(AA,N,M,BB)
      use cublas_device
      implicit none
      integer, value :: N, M
      real, device :: AA(N,M), BB(N,N)
      integer, device :: istat, transa, transb
      type(cublasHandle) :: handle
      
      if( threadidx%x==1 )then
        transa=0
        transb=1
        istat= cublasSgemm_v2(handle, transa, transb,4,4,2,1.0,AA,4,AA,4,0.,BB,4)        
      end if
      
      end subroutine testKernelCallCublas

[/code]

Hi Cy,

We didn’t support the V2 interfaces until the 15.9 release and have since made them the default for devices with compute capability 3.5 or higher (i.e. no need to explicitly call “_v2”). To use the interfaces, you will need a CC3.5 device such as a Tesla K40, as well.

For example, using 16.10:

PGI$ cat testv2.cuf
module testBLAS
    use cublas_device
    contains

      attributes(global) subroutine testKernelCallCublas(AA,N,M,BB)
       implicit none
       integer, value :: N, M
       real, device :: AA(N,M), BB(N,N)
       integer, device :: istat, transa, transb
       type(cublasHandle) :: handle

       if( threadidx%x==1 )then
         istat = cublasCreate(handle)
         transa=0
         transb=1
         istat= cublasSgemm(handle,transa,transb,4,4,2,1.0,AA,4,AA,4,0.,BB,4)

         istat = cublasDestroy(handle)
       end if

       end subroutine testKernelCallCublas

end module testBLAS

       program prog
       use testBLAS
       implicit none
       real, device :: dA(4,2), dB(4,4)
       real :: A(4,2), B(4,4)

       A(:,1)=1.
       A(:,2)=0.5

       dA=A
       call testKernelCallCublas<<<1,16>>>(dA,4,2,dB)
       B=dB

       print*,B(:,1)
       print*,B(:,2)
       print*,B(:,3)
       print*,B(:,4)

       end program prog

PGI$ pgfortran -Mcuda=cc35 -defaultlib=cublas_device -fast testv2.cuf
PGI$ ./testv2.exe
    1.250000        1.250000        1.250000        1.250000
    1.250000        1.250000        1.250000        1.250000
    1.250000        1.250000        1.250000        1.250000
    1.250000        1.250000        1.250000        1.250000

Hope this helps,
Mat

Hi Mat,

Appreciate your kind suggestion. I have GTX780Ti (CC3.5, on linux) and GTX850M (CC5.0, on laptop) gpu cards. Can I use them with PGI Community edition to develop CUDA code? Or the Tesla series gpu is needed?
Thank you very much.

CY

Hi Cy,

Yes, you can use both cards with the Community edition on Linux. I wouldn’t recommend using GTX cards for production code since they don’t have error correcting memory, but for development they should be fine. Windows can be problematic since the GTX cards use the WDDM driver, but on Linux they use the TCC driver.

  • Mat

Hi Mat,

Thanks for your reply. Would you recommend some material about ‘Error Correcting Memory’ ? Is it the same as ‘Error Collection Correcting’ ? And how to figure out the effect of ECC upon the results?

CY

Is it the same as ‘Error Collection Correcting’ ?

Yes, I’m meaning ECC.

And how to figure out the effect of ECC upon the results?

With ECC enables, the random bit flip that seldom does but can occur would be caught and fixed before it effects your results. For graphics, it doesn’t matter if a few pixels are off for a few frames but can matter for scientific applications.

  • Mat

Hi Mat,

I modified the dA & dB matrix as allocatable ones and found something weird.
After allocating device matrix, the VIRT increased abruptly to 63,400,828 Kb !!

  • [root@manysplendid ~]# ps aux | grep testCublas.x
    root 1794 0.0 0.0 246412 7656 pts/5 S+ 13:37 0:00 ./testCublas.x
    root 1796 0.0 0.0 105452 932 pts/4 S+ 13:37 0:00 grep testCublas.x
    [root@manysplendid ~]# ps aux | grep testCublas.x
    root 1794 2.5 0.6 63400828 199280 pts/5 Sl+ 13:37 0:00 ./testCublas.x
    root 1800 0.0 0.0 105452 932 pts/4 S+ 13:37 0:00 grep testCublas.x
    [root@manysplendid ~]#

The PGI version is 16.10 and GPU is GTX 780 Ti. OS is CentOs 6.6.
Following are the code and compile command.
Is there something wrong?
Thank you

CY

module testBLAS 
    use cublas_device 
    contains 

      attributes(global) subroutine testKernelCallCublas(AA,N,M,BB) 
       implicit none 
       integer, value :: N, M 
       real, device :: AA(N,M), BB(N,N) 
       integer, device :: istat, transa, transb 
       type(cublasHandle) :: handle 

       if( threadidx%x==1 )then 
         istat = cublasCreate(handle) 
         transa=0 
         transb=1 
         istat= cublasSgemm(handle,transa,transb,4,4,2,1.0,AA,4,AA,4,0.,BB,4) 

         istat = cublasDestroy(handle) 
       end if 

       end subroutine testKernelCallCublas 

end module testBLAS 

       program prog 
       use testBLAS 
       implicit none 
       real, allocatable, device :: dA(:,:), dB(:,:) 
       real :: A(4,2), B(4,4) 

       A(:,1)=1. 
       A(:,2)=0.5 
pause
       allocate(dA(4,2), dB(4,4))
print*, "after allocating"
pause
       dA=A 
       call testKernelCallCublas<<<1,16>>>(dA,4,2,dB) 
       B=dB 

       print*,B(:,1) 
       print*,B(:,2) 
       print*,B(:,3) 
       print*,B(:,4) 

       deallocate(dA, dB)
       end program prog



  • [root@manysplendid cublas]# pgfortran -Mcuda=cc3.5 -lcublas_device test2.cuf -o testCublas.x
    nvlink warning : SM Arch (‘sm_35’) not found in ‘/opt/pgi/linux86-64/2016/cuda/7.0/lib64/libcublas_device.a:maxwell_sgemm.asm.o’
    nvlink warning : SM Arch (‘sm_35’) not found in ‘/opt/pgi/linux86-64/2016/cuda/7.0/lib64/libcublas_device.a:maxwell_sm50_sgemm.o’
    nvlink warning : SM Arch (‘sm_35’) not found in ‘/opt/pgi/linux86-64/2016/cuda/7.0/lib64/libcublas_device.a:maxwell_sm50_ssyrk.o’
    [root@manysplendid cublas]# ./testCublas.x
    FORTRAN PAUSE: enter or d to continue>
    after allocating
    FORTRAN PAUSE: enter or d to continue>
    1.250000 1.250000 1.250000 1.250000
    1.250000 1.250000 1.250000 1.250000
    1.250000 1.250000 1.250000 1.250000
    1.250000 1.250000 1.250000 1.250000
    [root@manysplendid cublas]#

Hi CY,

Sorry but I don’t know where the extra memory usage would be coming from. My best guess is that cuBlas is adding some extra space when the CUDA context is created which happens to get created when you allocate your device arrays (the context is created at first use of the device).

  • Mat

Hi Mat,

Thanks for your response. I removed all the codes relative to cublas and compiled it with -Mcuda.

[root@manysplendid cublas]# cat test3.f90
       program prog
       implicit none
       real, allocatable, device :: dA(:,:), dB(:,:)
       real :: A(4,2), B(4,4)

       A(:,1)=1.
       A(:,2)=0.5
pause
       allocate(dA(4,2), dB(4,4))
print*, "after allocating"
pause
       dA=A
       B=dB

       print*,B(:,1)
       print*,B(:,2)
       print*,B(:,3)
       print*,B(:,4)

       deallocate(dA, dB)
       end program prog

[root@manysplendid cublas]# pgf90 -Mcuda test3.f90 -o test3.x
[root@manysplendid cublas]# ./test3.x

However, I got the same situation… @@

  • [root@manysplendid ~]# ps aux | grep test3.x
    root 27246 1.0 0.3 63349356 116948 pts/0 Sl+ 07:03 0:00 ./test3.x
    root 27367 0.0 0.0 105452 936 pts/2 S+ 07:04 0:00 grep test3.x
    [root@manysplendid ~]#

CY

Looks like it’s the context creation which happens upon first use of the device, which in this case is the allocation of the arrays. If you add some device code before the allocation, you’ll see the virtual memory created at that time.

  • Mat

Hi Mat,

Yes…
I added ‘j=cudaGetDeviceCount(i)’ before allocating device array and the VIRT jumped to 63163476. Thank you. But, it didn’t happen to the previous versions of compiler… is it normal ?

CY

Hi Cy,

I spot checked PGI versions all the way back to 12.10 and see the same behavior.

I don’t have a way to roll back CUDA Driver version so can’t test, but since context creation is done by the driver, different drivers could be using differing amounts of virtual memory.

  • Mat