Do we need to call cudaDeviceSynchronize after calling cublasSgemm?

Do the cublasXgemm routines take care of the cudaDeviceSynchronize, or should I call it after I call one of them?

Thank you

In general, the API is intended to be asynchronous. This isn’t always the case, but its generally not well specified, so the safe assumption is that it is asynchronous.

Therefore, when a call like cublasXgemm returns control to the host thread, its not guaranteed that the underlying work is complete.

However if your next step is additional CUDA work on those results, proper use of CUDA streams will prevent anything from running out of order, and it should not be necessary to use cudaDeviceSynchronize in most cases.

And if your next step is to gather the results (back to the host) using e.g. cudaMemcpy, then that also has synchronization built in, so no extra calls to cudaDeviceSynchronize should be necessary.

in general, for many CUDA programs, the use of cudaDeviceSynchronize will typically be rare.

Exceptions to the above might occur in some cases if you are using managed memory (and thus there is no cudaMemcpy operation before inspecting results) or if you use cudaMemcpyAsync, or if you pass pointers returned by e.g. cudaHostAlloc (also indicating you’re not actually using cudaMemcpy)

Thank you