Loop over cuBLAS routine

mkrygier1 · January 25, 2015, 10:28pm

Hi Mat,

Can I parallelize a do loop which contains a call to cublasdgemm in OpenACC?

For example see the code below.

      PROGRAM GPUMATMUL
      USE CUBLAS
      IMPLICIT NONE
      
      INTEGER I, J, K, N, NR, NZ, NN
      PARAMETER (NR=20, NZ=100, NN=1280)
      
      DOUBLE PRECISION WORK(1:NR-1,1:NZ-1,2,0:NN-1)
      DOUBLE PRECISION TWORK(1:NR-1,1:NZ-1,2,0:NN-1)
      DOUBLE PRECISION H(1:NR-1,1:NZ-1,0:NN-1)
      DOUBLE PRECISION BZN(1:NZ-1,1:NZ-1), ID, TMP
      LOGICAL EQUAL
      
      EQUAL = .TRUE.
      
      DO K=0, NN-1
         DO J=1, NZ-1
            DO I=1, NR-1
               ID = DFLOAT(J + (NZ-1)*(I-1))
               H(I,J,K) = ID
            END DO
         END DO
      END DO
      
      BZN = 1.0D0
    
!$ACC DATA COPYIN(H,BZN) COPYOUT(WORK)
!$ACC HOST_DATA USE_DEVICE(H,BZN,WORK)
      DO K=0, NN-1      
        CALL CUBLASDGEMM('N','N',NR-1,NZ-1,NZ-1,1.d0,H(1,1,K),NR-1,BZN,
     &                  NZ-1,0.D0,WORK(1,1,1,K),NR-1)
      END DO
!$ACC END HOST_DATA     
!$ACC END DATA    

      END PROGRAM

Is it possible to do all the matrix multiplications at once in one kernel using OpenACC? Or another way to do so?

On a side note, I have a question regarding cublasdgemm. From my understanding from reading OpenACC documentation, the cublasdgemm is being called on the host which then sends the data to the gpu using the gpu pointers supplied in the use_device clause. The matrix multiplication is then performed on the device/gpu. Is this correct?

MatColgrove · January 27, 2015, 5:19pm

Hi mkrygier1,

Is it possible to do all the matrix multiplications at once in one kernel using OpenACC? Or another way to do so?

Now that we have the ability to link device code, it should be possible to call the device, “v2”, versions of cuBLAS from an OpenACC compute kernel.

I’m trying to write an example now and have successfully been able run the code. However while the code runs correctly, it’s returning zeros as a result. I’m not sure if it’s my mistake or a bug someplace.

I continue to work on it but my workload is pretty heavy right now so it may be a bit before I can get back to it.

The matrix multiplication is then performed on the device/gpu. Is this correct?

Correct. The cuBLAS library will call a highly optimized CUDA kernel to perform the computation.

Mat

mkrygier1 · January 28, 2015, 4:19pm

Hi Mat,

I’m trying to write an example now and have successfully been able run the code. However while the code runs correctly, it’s returning zeros as a result. I’m not sure if it’s my mistake or a bug someplace.

I continue to work on it but my workload is pretty heavy right now so it may be a bit before I can get back to it.

That’s great, and I completely understand your situation! I look forward to seeing the example code when your finished with it.

Correct. The cuBLAS library will call a highly optimized CUDA kernel to perform the computation.

Thanks you for verifying my understanding.

Thanks again!

Sincerely,
Krygier

Topic		Replies	Views
OpenACC with cuBLAS and cuSPARSE in Fortran code Legacy PGI Compilers	7	8528	February 22, 2016
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran CUDA Programming and Performance	0	583	April 10, 2014
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7563	September 1, 2010
Multiplying array of matrices CUDA Programming and Performance	5	1024	February 4, 2010
CUBLAS operating on different parts of an array CUBLAS based code development CUDA Programming and Performance	2	8106	August 8, 2010
Element-by-element matrix-vector multiplication on GPU Legacy PGI Compilers	2	2609	September 18, 2017
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran GPU-Accelerated Libraries	2	1193	April 11, 2014
Matrix multiplication parallelizing Legacy PGI Compilers	4	6947	June 1, 2010
MatMul with openACC Legacy PGI Compilers	7	13112	December 17, 2012
About part of 4-D array multiply Legacy PGI Compilers	3	3150	January 7, 2014

Loop over cuBLAS routine

Related topics