Using Cublas in Device Kernels

I want to use cublasDgetrfBatched and cublasDgetriBatched to batch invert small matrices (6x6). These functions are called after a syncthreads command half way through a device kernel. The idea is batch inversion of a thread block of these small matrices.

When I go to compile I get the message: calls from device code to a host function are allowed only in emulation mode and that the function has not been declared. I think this simply means that the library functions aren’t being found from the module USE cublas_device. I can’t seem to find good up to date documentation on what predefined modules contain.

Do I need to define my own interface for these library functions on a k20 card, cuda5.0, and compiler V13.4?

I modeled my my cublas call after the cublas example in
http://www.pgroup.com/lit/articles/insider/v5n1a2.htm

where a cublas call looked like the following:

CONTAINS
  attributes(global) subroutine dgemm16(a, b, c, m, n, k)
    use cublas_device
    integer, value :: m, n, k
    double precision, device :: a(m,*), b(k,*), c(m,*)
    double precision, device :: alpha, beta
    type(cublasHandle) :: ch1
    integer transa, transb
    i = threadIdx%x
    if (i.eq.1) then
        istat = cublasCreate_v2(ch1)
        alpha = 1.0d0
        beta  = 0.0d0
        transa = cublas_op_n
        transb = cublas_op_n
        istat = cublasDgemm_v2(ch1, transa, transb, m, n, k, alpha, &
                                   a, m, b, k, beta, c, m)
        istat = cublasDestroy_v2(ch1)
    end if
    return
    end subroutine

This snippet is compiled with

  pgf90 -Mcuda=cuda5.0,cc35,rdc -fast dgemmdevcublas.cuf -o dgemmdevcublas.exe -lcublas_device

My make file is:

FLAGS = -V13.4 -fast -Mconcur=innermost 
FLAGS_CUDA =-Mcuda=cuda5.0,cc35,rdc -tp:x64 -lcublas_device
F90=pgf90

# Variables 
SOURCES = Variables.f90 CUDA_Kernels.f90 cpty.f90
OBJECTS = $(SOURCES: .f90=.o)
EXECUTABLE = CUDA_Parent

all: $(SOURCES) $(EXECUTABLE) 

$(EXECUTABLE): $(OBJECTS)
	$(F90) $(FLAGS) $(FLAGS_CUDA) $(OBJECTS) -o $@
.f90:
	$(F90) $(FLAGS) $(FLAGS_CUDA) $< -o $@

# Cleans 
.PHONY: clean
clean:
	rm *.mod *.o CUDA_Parent