Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code?


I am working in a MATLAB algorithm that it is solved on a GPU using CUBLAS. All works almost OK, but I need to implement some matrix computations that are not implemented in CUBLAS. Then my question is

How to do this? I suppose that I must mix CUDA/CUBLAS, but I dont know how to do this.

I think CUBLAS is implemented using CUDA.

Is there any way to get the CUDA source?

With many thanks in advance


You can mix up the launching of different kernels and cublas functions in the same program. Write a separate kernel that does the matrix computations you want and use cublas for the other functions. Yes, you can exchange data and pointers from your kernel to a cublas function. Just be careful about where the cublas functions returns the values to. For eg, the cublasSDot returns a float to the cpu. If you try to write that output to a location on the GPU you will get errors.

Thank you very much for your help maringanti

Do you know how can I see code to do this?. For example, it could be good, to have the CUBLAS sources…My idea is to do the kernel in a similar way as CUBLAS does, to maintain the good performance of CUBLAS.


Cublas Source code was available for version 1.1 - you can search the forums for it. I am not sure if the latest version’s source code is available. IMO you do not need to worry about coding it similar to CUBLAS. Just write the kernel in such a way that you are utilising the memory bandwidth efficiently and are launching enough number of threads so that all multiprocessors are busy. If you need further optimization you can look at volkov’s papers. I would say it is better to write the code in your own way instead of following the CUBLAS source structure.


Thank you very much.


Expanding upon this, since your question seems to be answered already…

When using CUBLAS, the vector or matrix is added to GPU memory, and then remains there until you free the device memory and shutdown CUBLAS. If my matrix is put on the GPU using cublas, can I access that matrix with a CUDA kernel just by it’s data pointer that is used for CUBLAS calls? Or is there extra formatting stored with the matrix when using CUBLAS?

For instance, if I wanted to:

  1. allocate a vector and matrix and store them to the GPU memory with CUBLAS
  2. multiply them with a kernel that I write myself
  3. read matrix from GPU memory back to host using CUBLAS

Is that possible and as easy as I would hope?


You can do that. The CUBLAS memory management functions (alloc, free, set, get) are just wrapper functions for the standard CUDA runtime API equivalents, and “CUBLAS pointers” are just regular GPU global memory pointers. There isn’t anything special about them. Almost without exception, the runtime API and your own kernels can be used interchangeably with CUBLAS functions.

Thank you very much. For this reason I would lile to see(if possible) any kernel implementing a CUBLAS call. Is this possible?

Thank you


The CUBLAS source is restricted to registered NVIDIA developers (and the current source release is rather out of data anyway). The current CUBLAS sgemm() implementation is (as I understand it), and wrapper around this kernel code written by Vasily Volkov from Berkeley. That might be enough to get going with.

Yes, This is what I was looking for…




I havea similar problem to the one of the original poster. My code is supposed to solve an equation interatively by calling a an update function until convergence is achieved. The fucntion update implies the following:

  1. a vector vector divion by elements dummy[i]=cr[i]/m1[i]

  2. a matrix vector multiplications ck[i]=sum_along_j(H[i][j]*dummy[j]

  3. vector vector multplication by each element ck[i]=k[i]*m2[i]

  4. new vector calculated dk[i]=k[i]*k[i]/(1-k[i])

  5. a vector vector divion by elements kdummy[i]=dk[i]/21[i]

  6. a matrix matrix multiplicationc dr[i]=sum_along_j(H[i][j]*kdummy[j]

  7. vector vector multplication by each element dr[i]=dr[i]*m1[i]

  8. newc[i]=some function of dr[i]

Does this imply that I have to do something like this? Define several kernels and call them in comabination with the cublas function?

  1. kernel 1

  2. cublas

  3. kernel 2

  4. kernel 3

  5. kernel 4

  6. cublas

  7. kernel 2

  8. kernel 5

Steps 1-8 have to be executed for about 10k -20k times to achieve convergence. Is it possible to have 5 kernels and call them togher with the cublas matri-vector multiplicatoin commands over and over without having to transfer any data to the host?