Are CUBLAS calls (such as DGEMM) blocking or non-blocking? It seems from the forums that they are non-blocking, yet the header file says that CUBLAS_STATUS_SUCCESS is returned if the operation completed successfully.
If it is non-blocking, is the correct way to wait for the operation to complete to call cudaThreadSynchronize()?
Can cudaThreadSynchronize() be used to determine if a CUBLAS error has occurred as opposed to cublasGetError()?
Is cublasAlloc() really just a wrapper for cudaMalloc()? That is, if I do a matrix-matrix multiplication, can I expect the the result to be laid out sequentially in memory in major column format? i.e. each column spaced out by ldc elements?
I know these questions are trivial and that the answers can be pieced together from the documentation, but I have been working for several months now to create a library consisting of CUBLAS commands with simple kernels of my own and I find an unacceptable number of “unknown” and “launch” errors which I can’t correct.
And this may be an a-hole question, but does anyone use CUBLAS? It seems like there is a significant amount of other matrix-matrix multiplications out there for CUDA? Is CUBLAS a lemon I should be avoiding?