Are CUBLAS calls (such as DGEMM) blocking or non-blocking? It seems from the forums that they are non-blocking, yet the header file says that CUBLAS_STATUS_SUCCESS is returned if the operation completed successfully.
If it is non-blocking, is the correct way to wait for the operation to complete to call cudaThreadSynchronize()?
Can cudaThreadSynchronize() be used to determine if a CUBLAS error has occurred as opposed to cublasGetError()?
Is cublasAlloc() really just a wrapper for cudaMalloc()? That is, if I do a matrix-matrix multiplication, can I expect the the result to be laid out sequentially in memory in major column format? i.e. each column spaced out by ldc elements?
I know these questions are trivial and that the answers can be pieced together from the documentation, but I have been working for several months now to create a library consisting of CUBLAS commands with simple kernels of my own and I find an unacceptable number of “unknown” and “launch” errors which I can’t correct.
And this may be an a-hole question, but does anyone use CUBLAS? It seems like there is a significant amount of other matrix-matrix multiplications out there for CUDA? Is CUBLAS a lemon I should be avoiding?
As best as I can tell, the header file says no such thing. Neither does the CUBLAS documentation. None of the core function return a status to the caller. They set an internal static status which indicates whether the last operation was successfully launched on the device, and that status can be accessed by calling cublasGetError after the BLAS function call. It doesn’t indicate whether the operation completed successfully.
You can do that, but you don’t have to. Copy operations will also wait until the kernel has finished executing.
cudaThreadSynchronize provides the status of the driver level operation. It cannot provide CUBLAS library level errors (like calling a BLAS function with incorrect arguments).
As far as I can tell (and I have not seen the CUBLAS source), it is, and you can use CUDA memory management functions interchangeably with CUBLAS ones in code that uses CUBLAS.
The library uses FORTRAN conventions (column major format with 1 based indexing). The first few pages of the first chapter of the CUBLAS documentation describe how to access this storage from C or C++.
Post some examples. If you have found real bugs or problems I am sure NVIDIA would love to hear about them (and from what I understand, one of the NVIDIA people largely responsible for CUBLAS is a frequent poster in this forum).
I use it. It works for me as advertised. Other implementations of selected BLAS functions can be faster than CUBLAS (Vasily Volkov from U.C. Berkeley has written a couple of papers demonstrating this for SGEMM, for example). And there are aspects of working with CUBLAS which are a little empirical - mostly estimating how large a matrix you can allocate on a device before you run out of memory and (if it applies) keeping the solution times of individual calls under the watchdog timer limit. But it certainly works.