Loop over cuBLAS routine

Hi Mat,

Can I parallelize a do loop which contains a call to cublasdgemm in OpenACC?

For example see the code below.

      PROGRAM GPUMATMUL
      USE CUBLAS
      IMPLICIT NONE
      
      INTEGER I, J, K, N, NR, NZ, NN
      PARAMETER (NR=20, NZ=100, NN=1280)
      
      DOUBLE PRECISION WORK(1:NR-1,1:NZ-1,2,0:NN-1)
      DOUBLE PRECISION TWORK(1:NR-1,1:NZ-1,2,0:NN-1)
      DOUBLE PRECISION H(1:NR-1,1:NZ-1,0:NN-1)
      DOUBLE PRECISION BZN(1:NZ-1,1:NZ-1), ID, TMP
      LOGICAL EQUAL
      
      EQUAL = .TRUE.
      
      DO K=0, NN-1
         DO J=1, NZ-1
            DO I=1, NR-1
               ID = DFLOAT(J + (NZ-1)*(I-1))
               H(I,J,K) = ID
            END DO
         END DO
      END DO
      
      BZN = 1.0D0
    
!$ACC DATA COPYIN(H,BZN) COPYOUT(WORK)
!$ACC HOST_DATA USE_DEVICE(H,BZN,WORK)
      DO K=0, NN-1      
        CALL CUBLASDGEMM('N','N',NR-1,NZ-1,NZ-1,1.d0,H(1,1,K),NR-1,BZN,
     &                  NZ-1,0.D0,WORK(1,1,1,K),NR-1)
      END DO
!$ACC END HOST_DATA     
!$ACC END DATA    

      END PROGRAM

Is it possible to do all the matrix multiplications at once in one kernel using OpenACC? Or another way to do so?

On a side note, I have a question regarding cublasdgemm. From my understanding from reading OpenACC documentation, the cublasdgemm is being called on the host which then sends the data to the gpu using the gpu pointers supplied in the use_device clause. The matrix multiplication is then performed on the device/gpu. Is this correct?

Hi mkrygier1,

Is it possible to do all the matrix multiplications at once in one kernel using OpenACC? Or another way to do so?

Now that we have the ability to link device code, it should be possible to call the device, “v2”, versions of cuBLAS from an OpenACC compute kernel.

I’m trying to write an example now and have successfully been able run the code. However while the code runs correctly, it’s returning zeros as a result. I’m not sure if it’s my mistake or a bug someplace.

I continue to work on it but my workload is pretty heavy right now so it may be a bit before I can get back to it.

The matrix multiplication is then performed on the device/gpu. Is this correct?

Correct. The cuBLAS library will call a highly optimized CUDA kernel to perform the computation.

  • Mat

Hi Mat,

I’m trying to write an example now and have successfully been able run the code. However while the code runs correctly, it’s returning zeros as a result. I’m not sure if it’s my mistake or a bug someplace.

I continue to work on it but my workload is pretty heavy right now so it may be a bit before I can get back to it.

That’s great, and I completely understand your situation! I look forward to seeing the example code when your finished with it.

Correct. The cuBLAS library will call a highly optimized CUDA kernel to perform the computation.

Thanks you for verifying my understanding.

Thanks again!

Sincerely,
Krygier