Asynchronous Parameter Passing how cuSPARSE/cuBLAS can do this?

Dear CUDA experts,
I was wondering if there’s a possiblity to know
how the new cuSparse and cuBlas libraries manage
the passing of scalar parameters by reference
remaining reentrant across threads/processes.

I would like to know just its mechanism, because
I’m working to provide the same asynchronous and reentrant support
(output from a function -> input in another function in the same stream)
to my library.

My first ideas involve critical sections, stacks of pointers and memory mapping…
but maybe there’s a simpler/smarter solution…

Thank you,

Kernels within the same stream execute in-order, so data dependencies between kernels in the same stream are not a problem. Before a kernel starts, all device data updates by a previous kernel in the same stream will have been completed. To correctly handle data dependencies between kernels in different streams, use explicit stream synchronization, for example via cudaEventRecord() and cudaEventSynchronize().

I hope this answers your question, because it seems like you may be talking about two separate issues (device data dependencies between CUDA kernels in streams, vs kernels issued from different host threads or processes).

Thank you for your response.

Sorry, I didn’t explain well my problem.

Kernel calls of the same stream are executed in a sequential fashion, it’s true,

the problems arise when you want to “schedule” (on the same stream) multiple kernel calls

that have intermediate results on the host side.

Quoting the 2.4 paragraph from the cublas guide (4.1):

Indeed, Cublas functions (in cublas_v2.h) like:

cublasSaxpy(cublasHandle_t handle, int n,const float *alpha,const float *x, int incx,float *y, int incy)

in the legacy version used to get scalar parameters like alpha as value and not by reference,

but in the new version they permit to schedule a kernel call on the same stream queue without explicit synchronization

(needed otherwise if you have to pass the value).

I would like to implement this feature also in my library for some functions,

but when “alpha” is a pointer to a host value this is not trivial as its seems.

When you pass a device pointer, the kernel will automatically get the device value on execution,

if you pass a host pointer you must provide a mechanism to let the kernel read the referenced value

only immediately before its execution (so that it will be filled by previous kernels).

Hope I clarified my problem.


I haven’t looked at the details of the new CUBLAS interface, but in general each CUBLAS function consists of a host-side function which invokes one or several kernels on the device. The host-side wrapper function executes synchronously, only the kernels on the device launch asynchronously.

Yeah, that’s what I know (I know some cuBLAS implementation basis since I’ve participated to the

implementation of the cuBLAS GEMM routine (CUBLAS guide, page 2) :) )

but I would like to know the details that make possible to do (quoting CUBLAS guide for new routines):

The question is: how cublas manages the host pointers of “alpha” and “beta” to read their value

asynchronously (they state that cuBLAS can do that…)?

A possible solution could be: since I’ve a host pointer I can add to the same stream a cudaMemcpyAsync from that pointer

to a constant device address, that will be filled immediately before the kernel call and so the kernel will read

from the constant memory its updated value…

BUT, problems arise if the application call this smart “wrapper” function for multiple streams

because a memcpy from other streams can overwrite the value in constant memory needed by the kernel

(I would need a different constant address for every different stream using my function).

Moreover, memcpyAsync needs pinned memory, and cublas doesn’t have such limitation on its parameters

We actually do not do it. In “Host pointer Mode”, we take the values of those pointers( alpha, beta) at the time of the launch.

And if the result has to be given to the Host ( like in cublasXnorm or CublasXdot ), the Cublas function is actually blocking ( it does a cudaMemcpy from Device to Host to get the result back )

Our asynchronous scheme will only work if you use the “Device Pointer Mode”

Typical sequence : alpha is computed by a kernel and then used as an input for a subsequent kernel



cublasDnorm (handle, n, v, incx, &alpha ); //alpha on the device

cublasDscal( handle, m, &alpha, w, incx);

In this case, with the Device Pointer Mode, the 2 kernels will be launched asynchronously in one shot and schedule sequentially on the GPU because they use the same stream. If Host pointer Mode had been used, cublasDnorm would have blocked the CPU (even if you use a non-NULL stream)

Thank you so much.

I didn’t realize this difference.

On second thoughts, in such scheme the programmer will always store ‘alpha’ or ‘beta’

in the device memory if he want an asynchronous behavior…