Dear CUDA experts,
I was wondering if there’s a possiblity to know
how the new cuSparse and cuBlas libraries manage
the passing of scalar parameters by reference
remaining reentrant across threads/processes.
I would like to know just its mechanism, because
I’m working to provide the same asynchronous and reentrant support
(output from a function → input in another function in the same stream)
to my library.
My first ideas involve critical sections, stacks of pointers and memory mapping…
but maybe there’s a simpler/smarter solution…
Kernels within the same stream execute in-order, so data dependencies between kernels in the same stream are not a problem. Before a kernel starts, all device data updates by a previous kernel in the same stream will have been completed. To correctly handle data dependencies between kernels in different streams, use explicit stream synchronization, for example via cudaEventRecord() and cudaEventSynchronize().
I hope this answers your question, because it seems like you may be talking about two separate issues (device data dependencies between CUDA kernels in streams, vs kernels issued from different host threads or processes).
I haven’t looked at the details of the new CUBLAS interface, but in general each CUBLAS function consists of a host-side function which invokes one or several kernels on the device. The host-side wrapper function executes synchronously, only the kernels on the device launch asynchronously.
We actually do not do it. In “Host pointer Mode”, we take the values of those pointers( alpha, beta) at the time of the launch.
And if the result has to be given to the Host ( like in cublasXnorm or CublasXdot ), the Cublas function is actually blocking ( it does a cudaMemcpy from Device to Host to get the result back )
Our asynchronous scheme will only work if you use the “Device Pointer Mode”
Typical sequence : alpha is computed by a kernel and then used as an input for a subsequent kernel
Example:
cublasSetPointerMode( CUBLAS_POINTER_MODE_DEVICE)
cublasDnorm (handle, n, v, incx, &alpha ); //alpha on the device
cublasDscal( handle, m, &alpha, w, incx);
In this case, with the Device Pointer Mode, the 2 kernels will be launched asynchronously in one shot and schedule sequentially on the GPU because they use the same stream. If Host pointer Mode had been used, cublasDnorm would have blocked the CPU (even if you use a non-NULL stream)