I am working in a MATLAB algorithm that it is solved on a GPU using CUBLAS. All works almost OK, but I need to implement some matrix computations that are not implemented in CUBLAS. Then my question is

How to do this? I suppose that I must mix CUDA/CUBLAS, but I dont know how to do this.

You can mix up the launching of different kernels and cublas functions in the same program. Write a separate kernel that does the matrix computations you want and use cublas for the other functions. Yes, you can exchange data and pointers from your kernel to a cublas function. Just be careful about where the cublas functions returns the values to. For eg, the cublasSDot returns a float to the cpu. If you try to write that output to a location on the GPU you will get errors.

Do you know how can I see code to do this?. For example, it could be good, to have the CUBLAS sources…My idea is to do the kernel in a similar way as CUBLAS does, to maintain the good performance of CUBLAS.

Cublas Source code was available for version 1.1 - you can search the forums for it. I am not sure if the latest version’s source code is available. IMO you do not need to worry about coding it similar to CUBLAS. Just write the kernel in such a way that you are utilising the memory bandwidth efficiently and are launching enough number of threads so that all multiprocessors are busy. If you need further optimization you can look at volkov’s papers. I would say it is better to write the code in your own way instead of following the CUBLAS source structure.

Expanding upon this, since your question seems to be answered already…

When using CUBLAS, the vector or matrix is added to GPU memory, and then remains there until you free the device memory and shutdown CUBLAS. If my matrix is put on the GPU using cublas, can I access that matrix with a CUDA kernel just by it’s data pointer that is used for CUBLAS calls? Or is there extra formatting stored with the matrix when using CUBLAS?

For instance, if I wanted to:

allocate a vector and matrix and store them to the GPU memory with CUBLAS

multiply them with a kernel that I write myself

read matrix from GPU memory back to host using CUBLAS

You can do that. The CUBLAS memory management functions (alloc, free, set, get) are just wrapper functions for the standard CUDA runtime API equivalents, and “CUBLAS pointers” are just regular GPU global memory pointers. There isn’t anything special about them. Almost without exception, the runtime API and your own kernels can be used interchangeably with CUBLAS functions.

The CUBLAS source is restricted to registered NVIDIA developers (and the current source release is rather out of data anyway). The current CUBLAS sgemm() implementation is (as I understand it), and wrapper around this kernel code written by Vasily Volkov from Berkeley. That might be enough to get going with.

I havea similar problem to the one of the original poster. My code is supposed to solve an equation interatively by calling a an update function until convergence is achieved. The fucntion update implies the following:

a vector vector divion by elements dummy[i]=cr[i]/m1[i]

a matrix vector multiplications ck[i]=sum_along_j(H[i][j]*dummy[j]

vector vector multplication by each element ck[i]=k[i]*m2[i]

new vector calculated dk[i]=k[i]*k[i]/(1-k[i])

a vector vector divion by elements kdummy[i]=dk[i]/21[i]

a matrix matrix multiplicationc dr[i]=sum_along_j(H[i][j]*kdummy[j]

vector vector multplication by each element dr[i]=dr[i]*m1[i]

newc[i]=some function of dr[i]

Does this imply that I have to do something like this? Define several kernels and call them in comabination with the cublas function?

kernel 1

cublas

kernel 2

kernel 3

kernel 4

cublas

kernel 2

kernel 5

Steps 1-8 have to be executed for about 10k -20k times to achieve convergence. Is it possible to have 5 kernels and call them togher with the cublas matri-vector multiplicatoin commands over and over without having to transfer any data to the host?