cublasSdot of CUBLAS 3.1 with streams

Hi all,

CUBLAS 3.1 supports using streams for overlapping computation to communication.
However, some BLAS functions store the result into the CPU memory rather than into the GPU device memory.
cublasSdot is an example, which does the vector dot operation and stores the resulting value into the CPU memory.
In this case, I guess the cublasSdot kernel call is basically a “synchronous” call, blocking the caller until the kernel execution is done.

In the link below, an nVIDIA person mentioned that they might add another version of this operation which stores the result into device memory.…t=#entry1042861

Does anyone know about the updated story on this?
I couldn’t find from CUBLAS 3.1 API that resolves this problem.
Does CUBLAS 3.1 still suffer from not fully supporting asynchronous CUDA kernel launches?



CUBLAS 3.1 supports streaming. But you are right for some BLAS1 routines like DOT or AXPY, you cannot really take advantage of it because those routines return their results on the Host.

Adding a Device-return version is still in our plan, not in the next release but definitively in the one after that.