cuBLAS stream behaviour

Hi,

I have some questions about stream handling in cuBLAS, mostly related to pointerMode_t. I wasn’t able to figure out the answers from the docs.

The most important question is: can I use multiple streams to queue asynchronous calls using host pointers?
For example, can I do this safely:

setPointerMode(h, HOST);
setStream(h, stream2);
cublasrotg(h, &a, &b, &c, &s);
cublasrot(h, n, x, y, &c, &s);

When I do it, then will cublasrot use values of c and s calculated in cublasrotg or is it not guaranteed?
If it’s possible, then… how? This would mean kernels are somehow able to read host memory, therefore cublasrot will either need to issue a host->dev copy or alocate pinned memory. Both operations are implicit synchronizations (), and that would defeat the whole idea of cuBLAS being asynchronous in HOST_PTR mode.

Another, simple question: can I use 2 library handles to simplify using separate streams?

Thanks in advance

Hi again,

I see no one answered this question since yesterday - since this is a question about unclear documentation, is there some way I can ask NVIDIA directly?

It’s quite an important question - I’m porting a piece of financial software (which has odd code in some places) and moving to DEVICE_POINTER_MODE would be troublesome to implement, so I need this resolved quickly.

Thanks

cublasrotg is a bit particular.
If you choose the DEVICE_POINTER_MODE= Host, then cublasrotg is a pure CPU routine (thus it is synchronous). So in that case the Stream does not really matter

To answer your questions:

When I do it, then will cublasrot use values of c and s calculated in cublasrotg or is it not guaranteed?
[Phil] yes, it is guaranteed, and the stream here does not matter

If POINTER_MODE=Device, then a,b,c,s MUST be on the device and in that case, cublasrotg and cublasrot must be run within the same CUDA stream. For performance point of view, this mode is preferable

can I use 2 library handles to simplify using separate streams?
[Phil] yes

Thanks for answering, Phil.

I’m still not 100% convinced - I initially thought the same way as you - but then I came across a section of the docs stating

Also, the few functions that return a scalar result, such as amax(), amin, asum(), rotg(), rotmg(), dot() and nrm2(), return the resulting value by reference on the host or the device. Notice that even though these functions return immediately, similarly to matrix and vector results, the scalar result is ready only when execution of the routine on the GPU completes. This requires proper synchronization in order to read the result from the host.

This implies that cublasrotg is always executed on the GPU and therefore not completely synchronous. My question could be rephrased as “is argument fetching synchronous in those functions?”.

The section in the docs can be found here: http://docs.nvidia.com/cuda/cublas/index.html#topic_3_4