scalars passed by reference to async procedures

In CUBLAS, some scalars are passed by reference to async procedures. I wonder if this means that code like

void f(...) {
    float beta = 0;
    cublasSgemm(..., &beta);
}

is invalid, since by the time cublasSgemm reads from &beta, f will likely have returned, making &beta a dangling pointer.

I am not completely sure I understand your question. The cublasSgemm() call occurs on the host, synchronous with the rest of the host code. So the input argument “beta” is retrieved synchronously at the time of the API call.

Internally to the CUBLAS API call one or several CUDA kernels are kicked off. What does that mean? A kernel launch command, along with kernel configuration data (grid, block, shared memory size, stream) and the kernel arguments is packaged up and stuffed into a work queue. At some unspecified later time, the whole blob is retrieved from the work queue by the GPU and kernel execution commences. So the kernel(s) itself execute asynchronously relative to the host code.

Now, if a kernel produces output data that the host code wants to pick up, synchronization is required to make sure the data is ready to be sent to the host. In the simplest case, this is done with a cudaMemcpy() call which is defined to include an initial synchronization step.

This question was brought up on Stackoverflow (several times, actually) and addressed by a member of the NVIDIA CUBLAS team (Phillipe V.) recently here:

[url]cuda - Asynchrony and memory ownership in CUBLAS - Stack Overflow

Thanks!