Combining cuBlas and Kernel code

We are developing an adaptive filter algorithm to run on the GPU. We need to create a Toeplitz matrix using a subsection of a data vector on the device. With the current implementation of the cuBlas functions we need to write kernel code to do this efficiently. We have assumed that the pointer to the object in GPU memory which cublasAlloc() returns can be manipulated in the kernel function just like an object allocated by cudaMalloc. We tested this theory by:

  1. Using cublasAlloc and cublasSetMatrix() to configure a matrix on the device to a known set of values.

  2. Call our kernel routine to manipulate the matrix on the device.

  3. Use cublasGetMatrix to return the matrix to the host and compare it with our “gold” result.

This test runs successfully on the graphics card so we are assuming that the pointer which cublasAlloc returns is in the same type of memory that cudaMalloc would return. Our concern is that in the cuBlas documentation we did not find anything saying that it is legal to manipulate the cuBlas matrix pointer in this way and that in a later cuBlas version this may not work if the underlying cuBlas allocation scheme is changed. Does anyone have any info on this subject? Thanks!


What you are doing, it is legal. cublasAlloc returns a device pointer, that can be used in kernel functions.

cublasAlloc is calling cudaMalloc under the hood, it is there to give access to device memory from regular C or Fortran.

We will add a note to the documentation

I’m doing something similar for 2D convolution. After copying to the device, I use cuFFT to compute the forward FFT, run a custom kernel to do the complex multiply, then use cuFFT to compute the inverse FFT. Everything stays on the device. It works just fine.

The new fluidsGL and fluidsD3D examples demonstrate this. Most of the code is straight cuda kernels, which a couple of calls to CUFFT to transform to and from the frequency domain for the projection step of the stable fluids algorithm.



I am trying to create an iterative sparse solver that requires Cusparse and Cublas libraries but also kernel launches. i tried to look for similar cases but without success.
i have inside a while the following:




all functions work, but not in these sequence. i tried to use cudaThreadSynchronize() before and after kernel launch
but didn’t work. is there a syncronization problem? in the second cusparse matrix vector multiplication i get values with 1.#QNAN.


Simão Pereira

At the pseudocode level, there is nothing wrong with the idea and there should be no synchronization required to make it work.

thanks. i think i solved the problem.

maybe the error was arrising from my kernel implementation. i have arrays with one base index. i added the condition tid > 0 to the while and it seems the problem is solved.

global void GetSolution(precision *A_dev,precision *x_dev, precision *Ax_dev, int *ElemDiag_dev)


int tid = threadIdx.x + blockIdx.x * blockDim.x;

while(tid <=NElem_dev && tid > 0) 


	int posDiag=ElemDiag_dev[tid];


	tid += blockDim.x * gridDim.x;




Simão Pereira


Following the pseudocode I described before, i have been trying to use the cusparse library with one-based index. i manage to get the correct result if i use the zero-based index. in order to test the cusparseScsrmv one-based index i used the code i present afterwords.
i did this simple test using a similar code and it works with zero-based index.
can somebody try to help me to figure it out which mistake i am doing, please?

I attach the code. thanks.

############################################################ (3.86 KB)

could anybody try to help please?

anyone that has used cusparse matrix-vector multiplication function with one-based index?


Hi everyone,

Can I call cublas function inside the kernel?

Thanks & Regards,

No, you cannot.

thanks Avidday

I have also tried that in my project. Unfortunately, no, you won’t pass the compilation. It is also a pity to me that the CUBLAS accelerated functions may not be used in the kernel.

This is a very old thread. After 2011, the capability has been added to call cublas function in the kernel code, and there are CUDA sample projects that demonstrate this, such as simpleDevLibCublas–cuda-dynamic-parallelism-