cuBLAS Dgemm "Could not Resolve Generic Procedure

yunus.altintop.2 · October 7, 2021, 11:30am

Hi,

When I use cuBLAS Dgemm routine, it gave “NVFORTRAN-S-0155-Could not resolve generic procedure cublasdgemm (nondev.f90: 66)”. I tried two different methods; one of them with the C pointers, C_DEVLOC( ), and the other one without it. Both gave the same error. The codes are as follows:

Program main

! CUDA MODULES
USE openacc
USE cublas
USE cudafor

IMPLICIT NONE

! COUNTERS
INTEGER :: A2,L,L0,M,P1,P2,NUCY

! VARIABLES
INTEGER, parameter :: N = 2000
INTEGER, parameter :: P = 1000

DOUBLE PRECISION, allocatable, dimension(:,:) :: X
DOUBLE PRECISION, allocatable, dimension(:,:) :: Y
DOUBLE PRECISION, allocatable, dimension(:,:) :: Z

! CUDA VARIABLES
TYPE(C_DEVPTR) :: devptr_A, devptr_B, devptr_C
TYPE(cublasHandle) :: handle

INTEGER :: status

DOUBLE PRECISION :: alpha, beta

alpha = 1.0D0
beta = 0.0D0

ALLOCATE(X(2,P))
ALLOCATE(Y(N,P))
ALLOCATE(Z(N,1))

Y = 1.0D0
Z = 0.180D0
X = 0.0D0

P1 = 0
  DO L0 = 2,NUCY
    L = L0-2
    DO M = -L,L
      P2 = P1
      DO A2 = 2,3
        P2 = P2+1
	  
	  ! CUDA BLAS ROUTINE -----------------------------------------------------
	  
	  status =cublasCreate(handle)    
	        
	  !$ACC data copy(Y,Z(:,P2),X(A2-1,:)) create(devptr_A, devptr_B, devptr_C) 
	      
	  !$ACC host_data use_device(Y,Z(:,P2),X(A2-1,:))
	  devptr_A = C_DEVLOC(Y)
	  devptr_B = C_DEVLOC(Z(:,P2))
	  devptr_C = C_DEVLOC(X(A2-1,:))
	  !$ACC end host_data
	      
	  !$ACC update device(devptr_A, devptr_B, devptr_C)
	      
	  status = cudaDeviceSynchronize()
	      
	  !$ACC host_data use_device(devptr_A, devptr_B, devptr_C)
	      
	  status = cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, P, 1, N, alpha, devptr_A, N, devptr_B, N, beta, devptr_C, P)
	      
	  !$ACC end host_data
	      
	  status = cudaDeviceSynchronize()
	      
	  !$ACC end data
	      
	  status = cublasDestroy(handle)
	      
	  ! -----------------------------------------------------------------------

      END DO
    END DO ! M
  END DO ! L0

END program main

The one without C_DEVLOC( )

Program main

! CUDA MODULES
USE openacc
USE cublas

IMPLICIT NONE

! COUNTERS
INTEGER :: A2,L,L0,M,P1,P2,NUCY

! VARIABLES
INTEGER, parameter :: N = 2000
INTEGER, parameter :: P = 1000

DOUBLE PRECISION, allocatable, dimension(:,:) :: X
DOUBLE PRECISION, allocatable, dimension(:,:) :: Y
DOUBLE PRECISION, allocatable, dimension(:,:) :: Z

! CUDA VARIABLES
TYPE(cublasHandle) :: handle

INTEGER :: status

DOUBLE PRECISION :: alpha, beta

alpha = 1.0D0
beta = 0.0D0

ALLOCATE(X(2,P))
ALLOCATE(Y(N,P))
ALLOCATE(Z(N,1))

Y = 1.0D0
Z = 0.180D0

NUCY = 4
P1 = 0

!$acc data copyin(Y,Z) create(X)

  DO L0 = 2,NUCY
    L = L0-2
    DO M = -L,L
      P2 = P1
      DO A2 = 2,3
        P2 = P2+1      
				  
	! CUDA BLAS ROUTINE -----------------------------------------------------
	      
	  status =cublasCreate(handle)    
	  
	  IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"
	      
	  !$ACC host_data use_device(Y,Z(:,P2),X(A2-1,:))
	      
	  status = cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, P, 1, N, alpha, Y, N, Z(:,P2), N, beta, X(A2-1,:), P)
	  
	  !$ACC update host(X(A2-1,:))    
	      
	  !$ACC end host_data
	      
	  IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
	      
          status = cublasDestroy(handle)
	      
        ! -----------------------------------------------------------------------
      END DO
    END DO ! M
  END DO ! L0

!$ACC end data

END program main

Both works with the Makefile as follows:

FC=nvfortran
TIMER=/usr/bin/time
OPT=
NOPT=-fast -Minfo=opt $(OPT)
FCFLAGS = -Mpreprocess -fast -acc=gpu -cuda -Mcudalib=cublas -Mcudalib=cusparse

nondev: nondev.o
$(TIMER) ./nondev.o $(STEPS)
nondev.o: nondev.f90
$(FC) $(FCFLAGS) -o $@ $< $(NOPT) -Minfo=accel -acc

clean:
rm -f *.o *.exe *.s *.mod a.out

This region of the code works fine with Lapack. I could not figure out what the problem is.

Thanks,
Yunus

MatColgrove · October 7, 2021, 4:39pm

Hi Yunus,

There’s a couple of issues here. First, there are three different interfaces for cublasDgemm and you’re mixing two of them. See: NVIDIA Fortran CUDA Library Interfaces Version 22.7 for ARM, OpenPower, x86

Plain “cublasDgemm” does not take a handle while the “v2” version does. So either change the call to “cublasDgemm_v2” or “use cublas_v2” which will switch “cublasDgemm” to use the v2 version.

Next, you can’t pass in non-contiguous array slices, i.e. “X(A2-1,:)”, since this would require the compiler to create a temp array of the device array on the host. You’ll need to pack it in you’re own temp array, or switch the dimensions, i.e. “X(P,2)” and pass “X(:,A2-1)”.

Move the “acc update” of “X” after the host data region since within the host data region it’s using the device pointer.

Finally, the interfaces are expecting device arrays, not raw pointers, so you wouldn’t want to use the “C_DEVLOC” version. The raw pointers are only used with the batched version, cublasDgemmBatched, since it’s expecting an array of pointers.

Something like the following (I haven’t checked for correctness):

Program main

! CUDA MODULES
USE openacc
USE cublas_v2

IMPLICIT NONE

! COUNTERS
INTEGER :: A2,L,L0,M,P1,P2,NUCY

! VARIABLES
INTEGER, parameter :: N = 2000
INTEGER, parameter :: P = 1000

DOUBLE PRECISION, allocatable, dimension(:,:) :: X
DOUBLE PRECISION, allocatable, dimension(:,:) :: Y
DOUBLE PRECISION, allocatable, dimension(:,:) :: Z

! CUDA VARIABLES
TYPE(cublasHandle) :: handle

INTEGER :: status

DOUBLE PRECISION :: alpha, beta

alpha = 1.0D0
beta = 0.0D0

ALLOCATE(X(P,2))
ALLOCATE(Y(N,P))
ALLOCATE(Z(N,1))

Y = 1.0D0
Z = 0.180D0

NUCY = 4
P1 = 0

!$acc data copyin(Y,Z) create(X)

  DO L0 = 2,NUCY
    L = L0-2
    DO M = -L,L
      P2 = P1
      DO A2 = 2,3
        P2 = P2+1

        ! CUDA BLAS ROUTINE -----------------------------------------------------

          status =cublasCreate(handle)

          IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed"

          !$ACC host_data use_device(Y,Z(:,P2),X(:,A2-1))
          status = cublasDgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, P, 1, N, alpha, Y, N, Z(:,P2), N, beta, X(:,A2-1), P)
          !$ACC end host_data
          !$ACC update host(X(:,A2-1))

          IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status

          status = cublasDestroy(handle)

        ! -----------------------------------------------------------------------
      END DO
    END DO ! M
  END DO ! L0

!$ACC end data

END program main

Hope this helps,
Mat

yunus.altintop.2 · October 7, 2021, 4:57pm

Thank you. These fixed the problems.

-Yunus

yunus.altintop.2 · October 17, 2021, 7:51pm

Hi again,

When I tried to use cublasDgemm to get a matrix from multiplication of two vectors, it does not give the correct result.

Program main

! CUDA MODULES
USE openacc
USE cublas_v2

IMPLICIT NONE

! COUNTERS
INTEGER :: A2,L,L0,M,P1,P2,NUCY

DOUBLE PRECISION :: T

! VARIABLES
INTEGER, parameter :: N = 2000
INTEGER, parameter :: P = 1000

DOUBLE PRECISION, allocatable, dimension(:,:) :: X
DOUBLE PRECISION, allocatable, dimension(:,:) :: V
DOUBLE PRECISION, allocatable, dimension(:,:) :: Z

! CUDA VARIABLES
TYPE(cublasHandle) :: handle

INTEGER :: status

DOUBLE PRECISION :: alpha, beta

beta = 1.0D0

ALLOCATE(X(P,2))
ALLOCATE(V(N,P))
ALLOCATE(Z(N,1))

X = 1.0D0
V = 1.0D0
Z = 0.180D0

NUCY = 4
P1 = 0

T = 1.0D0

!$acc data copyin(Z,X,V) copyout(V)

  DO L0 = 2,NUCY
    L = L0-2
    DO M = -L,L
      P2 = P1
      T = T + 1.0D0
      DO A2 = 2,3
        P2 = P2+1      
	
	alpha = T*T
		  
	! CUDA BLAS ROUTINE -----------------------------------------------------
	      
	  status =cublasCreate(handle)    
	  
	  IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed", status
	      
	  !$ACC host_data use_device(Z(:,P2),X(:,A2-1),V)
	      
	  status = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, N, P, 1, alpha, Z(:,P2), N, X(:,A2-1), N, beta, V, N)  
	      
	  !$ACC end host_data
	      
	  !$ACC update host(V)  
	      
	  IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
	      
          status = cublasDestroy(handle)
	      
        ! -----------------------------------------------------------------------
      END DO
    END DO ! M
  END DO ! L0

!$ACC end data

END program main

It works but does not give the expected results. It gives extremely different results. It was very similar to the earlier question but I could not solve the problem. If you can help, it would be amazing.

Thanks,
-Yunus

MatColgrove · October 18, 2021, 3:24pm

Hi Yunus,

Shouldn’t you allocating Z’s second dimension to 2? The “P2” index has the values 1 and 2, so if Z’s dimension is only 1, the code is reading out of bounds.

Though, I didn’t see any verification in the example, so I’m not sure how you’re checking for correct results.

Here’s what I did. Note that I updated the code a bit but these are just improvements and shouldn’t impact verification.

-Mat

% cat test.f90
Program main

! CUDA MODULES
USE openacc
USE cublas_v2

IMPLICIT NONE

! COUNTERS
INTEGER :: A2,L,L0,M,P1,P2,NUCY

DOUBLE PRECISION :: T

! VARIABLES
INTEGER, parameter :: N = 2000
INTEGER, parameter :: P = 1000

DOUBLE PRECISION, allocatable, dimension(:,:) :: X
DOUBLE PRECISION, allocatable, dimension(:,:) :: V
DOUBLE PRECISION, allocatable, dimension(:,:) :: Z

! CUDA VARIABLES
TYPE(cublasHandle) :: handle

INTEGER :: status

DOUBLE PRECISION :: alpha, beta

beta = 1.0D0

ALLOCATE(X(P,2))
ALLOCATE(V(N,P))
ALLOCATE(Z(N,2))

X = 1.0D0
V = 1.0D0
Z = 0.180D0

NUCY = 4
P1 = 0

T = 1.0D0
  status =cublasCreate(handle)
  IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasCreate failed", status

!$acc data copyin(Z,X) copy(V)

  DO L0 = 2,NUCY
    L = L0-2
    DO M = -L,L
      P2 = P1
      T = T + 1.0D0
      DO A2 = 2,3
        P2 = P2+1

        alpha = T*T

        ! CUDA BLAS ROUTINE -----------------------------------------------------
          !$ACC host_data use_device(Z(:,P2),X(:,A2-1),V)
          status = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, N, P, 1, alpha, Z(:,P2), N, X(:,A2-1), N, beta, V, N)
          !$ACC end host_data
          !$ACC update host(V)
          print *, V(1,1)
        ! -----------------------------------------------------------------------
      END DO
    END DO ! M
  END DO ! L0

!$ACC end data
          IF(status /= CUBLAS_STATUS_SUCCESS) write(*,*) "cublasDgemm failed:", status
          status = cublasDestroy(handle)
          print *, "END:", V(1,1)


END program main

yunus.altintop.2 · October 18, 2021, 5:08pm

Hi Mat,

You are right, I shared it with a type-o. Sorry. I prepared a dummy of my real code. Actually I am testing at the real code.

I solved the problem with your example. As it seems in my example, I transferred the data with

!$acc data copyin(Z,X,V) copyout(V)

I thought that “!$acc data copyin(V) copyout(V)” is the same with “!$acc data copy(V)”.
Yet when I change the data transfer directive to

!$acc data copyin(Z,X) copy(V)

, it started to give the true results that I calculated in the LAPACK or a basic code of mine.

It seems awkward, are not they supposed to be the same?

Thanks

-Yunus

MatColgrove · October 18, 2021, 8:44pm

Nope, data clauses aren’t combined but processed individually.

“present_or” semantics has the compiler runtime test whether a variable is a already present on the device. If it is present, then no action is taken. So when processed in order (either left-to-right or right-to-left is implementation dependent), the first data clause encountered is performed but the second is ignored.

system · December 17, 2021, 8:45pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PGF90-S-0155-Could not resolve generic procedure cublasdgemm Legacy PGI Compilers	4	3888	November 10, 2018
cublasZgemm() gives false result for large data and potential bug GPU-Accelerated Libraries	6	1146	October 12, 2021
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5244	May 13, 2013
openacc with cublas batched routine in fortran Legacy PGI Compilers	7	8219	January 27, 2017
Issue when calling cublasDdot from within kernel GPU-Accelerated Libraries	7	925	March 21, 2018
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4831	April 7, 2021
New cuBlas function "getrfBatched" Legacy PGI Compilers	5	8523	January 21, 2014
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7459	January 11, 2020
Strange "unspecified launch error" from a call to cublas gemm CUDA Programming and Performance	23	2677	January 19, 2019
cublasDmatinvBatched typemismatch nvc, nvc++ and nvfortran	6	978	October 28, 2021

cuBLAS Dgemm "Could not Resolve Generic Procedure

Related topics