CUBLAS 3.1 supports using streams for overlapping computation to communication.
However, some BLAS functions store the result into the CPU memory rather than into the GPU device memory.
cublasSdot is an example, which does the vector dot operation and stores the resulting value into the CPU memory.
In this case, I guess the cublasSdot kernel call is basically a “synchronous” call, blocking the caller until the kernel execution is done.
In the link below, an nVIDIA person mentioned that they might add another version of this operation which stores the result into device memory.
Does anyone know about the updated story on this?
I couldn’t find from CUBLAS 3.1 API that resolves this problem.
Does CUBLAS 3.1 still suffer from not fully supporting asynchronous CUDA kernel launches?