cublasGemmEx() function with __dp4a() can't be called within a kernel whereas cublassgemm() can be c...

Hello ,

cublasGemmEx() function with __dp4a can’t be called from a kernel with global keyword. It gives undefined symbol error whereas in the same file the cublasSGemm() can be called successfully.

Dynamic parallelism support is supported for cublasGemmEx() or not ?

Thnks

MAKEFILE:

/usr/local/cuda-8.0/bin/nvcc -I/usr/local/cuda-8.0/samples/common/inc/ -O3 -std=c++11 --cubin --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 kernel.cu -link -L/usr/local/cuda-8.0/lib64 -lcudadevrt -lcublas -lcublas_device -dlink -o kernel.cubin

Linking Error :

/usr/local/cuda-8.0/bin/nvcc -I/usr/local/cuda-8.0/samples/common/inc/ -O3 -std=c++11 --cubin --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 binary_kernels.cu -link -L/usr/local/cuda-8.0/lib64 -lcudadevrt -lcublas -lcublas_device -dlink -o kernel.cubin
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
nvlink error : Undefined reference to ‘cublasGemmEx’ in ‘/tmp/tmpxft_000027aa_00000000-15_binary_kernels.o’
makefile:2: recipe for target ‘all’ failed
make: *** [all] Error 255

File: kernal.cu

extern “C” {
global void invokeDeviceCublasSgemm(float* A, float* B, float* C, int m, int n, int k)

{

cublasHandle_t cnpHandle;
cublasStatus_t status = cublasCreate(&cnpHandle);

float alpha = 1.0;
float beta = 0.0;

/* Perform operation using cublas */
status =
    cublasSgemm(cnpHandle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                k, m, n,
                &alpha,
                B, k,
                A, n,
                &beta,
                C, k);



cublasDestroy(cnpHandle);

}
}
extern “C” {
global void invokeDeviceCublasGemmEx(char* A, char* B, int* C, int m, int n, int k)

{

cublasHandle_t cnpHandle;
cublasStatus_t status = cublasCreate(&cnpHandle);

int alpha = 1;
int beta = 0;

    
/* Perform operation using cublas */
status = cublasGemmEx(cnpHandle, CUBLAS_OP_N, CUBLAS_OP_N,
                                  k, m, n,
                                  &alpha, B, CUDA_R_8I ,k,
                                          A, CUDA_R_8I ,n,
                                  &beta,  C, CUDA_R_32I ,k, CUDA_R_32I,CUBLAS_GEMM_DFALT);// colE  x rowD


cublasDestroy(cnpHandle);

}
}

Updated :

The above undefined symbol is in below static library:

/usr/local/cuda-8.0/lib64/libcublas_static.a
00000000000013a0 T cublasGemmEx

However, Changing the Makefile as below by including this library -lcublas_static , still it gives the same error :

/usr/local/cuda-8.0/bin/nvcc -I/usr/local/cuda-8.0/samples/common/inc/ -O3 -std=c++11 --cubin --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 kernel.cu -link -L/usr/local/cuda-8.0/lib64 lcublas_static -lcudadevrt -lcublas -lcublas_device -dlink -o kernel.cubin

Yes, it appears to be a problem. I have filed an (internal) bug with NVIDIA.
You’re welcome to file a bug too. I see you already know how.

As an immediate workaround, I suggest calling cublasGemmEx from host code rather than from device code.

Hello txbob,

I can’t find a workaround to solve my use case :
I am calling the above function in .cu file from PYCUDA, but this
gemm_kernel = mod.get_function(“invokeDeviceCublasGemmEx”)
pycuda._driver.LogicError: (‘The following error happened while compiling the node’, Gemm_Cublas(GpuContiguous.0, GpuContiguous.0), ‘\n’, ‘cuModuleGetFunction failed: named symbol not found’

Python code:
mod = drv.module_from_file(‘kernel.cubin’)
gemm_kernel = mod.get_function(“invokeDeviceCublasGemmEx”)

I think get_function() expects the function name with global keyword. So, I cannot call host code function from Pycuda. And my use case is calling host function implementing CUBLAS APIs from Theano, which doesn’t work with ctypes as it takes CUDANDARRAY …

Don’t know how to proceed now!!!

Can you please tell me the time frame for bug fix? We need to fix the bug asap !!!

I don’t have a time frame for a bug fix. This is not a bug but expected behavior at this time. The cublasGemmEx function is not supported in device code (the docs should be updated to reflect that - agreed).

I’m pretty certain that the method I outlined here:

https://devtalk.nvidia.com/default/topic/1024278/cuda-programming-and-performance/can-a-cuda-kernel-call-cublas-function-or-how-to-call-a-cublas-functions-from-python-/post/5210974/#5210974

can be used to work around this. You can pass either host or device pointers to a wrapper function using ctypes. It’s understood that you can’t use cudandarray with this. You will need to organize your data differently.

http://deeplearning.net/software/theano/tutorial/gpu_data_convert.html