cublasGemmEx() function with __dp4a() can't be called within a kernel whereas cublassgemm() can be c...

adit_bhrgv · September 24, 2017, 11:02pm

Hello ,

cublasGemmEx() function with __dp4a can’t be called from a kernel with global keyword. It gives undefined symbol error whereas in the same file the cublasSGemm() can be called successfully.

Dynamic parallelism support is supported for cublasGemmEx() or not ?

Thnks

MAKEFILE:

/usr/local/cuda-8.0/bin/nvcc -I/usr/local/cuda-8.0/samples/common/inc/ -O3 -std=c++11 --cubin --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 kernel.cu -link -L/usr/local/cuda-8.0/lib64 -lcudadevrt -lcublas -lcublas_device -dlink -o kernel.cubin

Linking Error :

/usr/local/cuda-8.0/bin/nvcc -I/usr/local/cuda-8.0/samples/common/inc/ -O3 -std=c++11 --cubin --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 binary_kernels.cu -link -L/usr/local/cuda-8.0/lib64 -lcudadevrt -lcublas -lcublas_device -dlink -o kernel.cubin
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
ptxas info : ‘device-function-maxrregcount’ is a BETA feature
nvlink error : Undefined reference to ‘cublasGemmEx’ in ‘/tmp/tmpxft_000027aa_00000000-15_binary_kernels.o’
makefile:2: recipe for target ‘all’ failed
make: *** [all] Error 255

File: kernal.cu

extern “C” {
global void invokeDeviceCublasSgemm(float* A, float* B, float* C, int m, int n, int k)

{

cublasHandle_t cnpHandle;
cublasStatus_t status = cublasCreate(&cnpHandle);

float alpha = 1.0;
float beta = 0.0;

/* Perform operation using cublas */
status =
    cublasSgemm(cnpHandle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                k, m, n,
                &alpha,
                B, k,
                A, n,
                &beta,
                C, k);



cublasDestroy(cnpHandle);

}
}
extern “C” {
global void invokeDeviceCublasGemmEx(char* A, char* B, int* C, int m, int n, int k)

{

cublasHandle_t cnpHandle;
cublasStatus_t status = cublasCreate(&cnpHandle);

int alpha = 1;
int beta = 0;

    
/* Perform operation using cublas */
status = cublasGemmEx(cnpHandle, CUBLAS_OP_N, CUBLAS_OP_N,
                                  k, m, n,
                                  &alpha, B, CUDA_R_8I ,k,
                                          A, CUDA_R_8I ,n,
                                  &beta,  C, CUDA_R_32I ,k, CUDA_R_32I,CUBLAS_GEMM_DFALT);// colE  x rowD


cublasDestroy(cnpHandle);

}
}

adit_bhrgv · September 25, 2017, 12:55am

Updated :

The above undefined symbol is in below static library:

/usr/local/cuda-8.0/lib64/libcublas_static.a
00000000000013a0 T cublasGemmEx

However, Changing the Makefile as below by including this library -lcublas_static , still it gives the same error :

/usr/local/cuda-8.0/bin/nvcc -I/usr/local/cuda-8.0/samples/common/inc/ -O3 -std=c++11 --cubin --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 kernel.cu -link -L/usr/local/cuda-8.0/lib64 lcublas_static -lcudadevrt -lcublas -lcublas_device -dlink -o kernel.cubin

Robert_Crovella · September 25, 2017, 1:07am

Yes, it appears to be a problem. I have filed an (internal) bug with NVIDIA.
You’re welcome to file a bug too. I see you already know how.

As an immediate workaround, I suggest calling cublasGemmEx from host code rather than from device code.

adit_bhrgv · September 25, 2017, 1:37am

Hello txbob,

I can’t find a workaround to solve my use case :
I am calling the above function in .cu file from PYCUDA, but this
gemm_kernel = mod.get_function(“invokeDeviceCublasGemmEx”)
pycuda._driver.LogicError: (‘The following error happened while compiling the node’, Gemm_Cublas(GpuContiguous.0, GpuContiguous.0), ‘\n’, ‘cuModuleGetFunction failed: named symbol not found’

Python code:
mod = drv.module_from_file(‘kernel.cubin’)
gemm_kernel = mod.get_function(“invokeDeviceCublasGemmEx”)

I think get_function() expects the function name with global keyword. So, I cannot call host code function from Pycuda. And my use case is calling host function implementing CUBLAS APIs from Theano, which doesn’t work with ctypes as it takes CUDANDARRAY …

Don’t know how to proceed now!!!

Can you please tell me the time frame for bug fix? We need to fix the bug asap !!!

Robert_Crovella · September 25, 2017, 5:10am

I don’t have a time frame for a bug fix. This is not a bug but expected behavior at this time. The cublasGemmEx function is not supported in device code (the docs should be updated to reflect that - agreed).

I’m pretty certain that the method I outlined here:

[url]https://devtalk.nvidia.com/default/topic/1024278/cuda-programming-and-performance/can-a-cuda-kernel-call-cublas-function-or-how-to-call-a-cublas-functions-from-python-/post/5210974/#5210974[/url]

can be used to work around this. You can pass either host or device pointers to a wrapper function using ctypes. It’s understood that you can’t use cudandarray with this. You will need to organize your data differently.

[url]http://deeplearning.net/software/theano/tutorial/gpu_data_convert.html[/url]

Topic		Replies	Views
Can a cuda kernel call CUBLAS function or how to call a cublas functions from Python ? CUDA Programming and Performance	7	3751	November 15, 2019
Calling CUBLAS routines from insdie a kernel Legacy PGI Compilers	1	4848	June 28, 2011
How to call cublas library into my cuda fortran code? Legacy PGI Compilers	5	6007	December 8, 2011
Cublas function calls inside kernel code CUDA Programming and Performance	2	11540	October 23, 2007
Nvlink error : Undefined reference to 'cublasZgemm_v2' in ******.obj' GPU-Accelerated Libraries cublas	19	2199	May 1, 2025
Using CUDA Libraries from CUDA Fortran Device Code Legacy PGI Compilers	6	7496	July 19, 2017
Cublas function call from within the kernel ? is it possible ? CUDA Programming and Performance	4	2706	April 2, 2008
cublasGemmEX() INT-8 runtime error GPU-Accelerated Libraries cuda	7	2103	October 12, 2021
It is about cublasDx library GPU-Accelerated Libraries cublas	0	48	April 12, 2025
cublasGemmEx failed: 15, ieee_inexact in Fortran and C++ GPU-Accelerated Libraries cuda	4	137	February 28, 2025

cublasGemmEx() function with __dp4a() can't be called within a kernel whereas cublassgemm() can be c...

Related topics