NVBLAS cublasXtZgemm failed with error 3/8


I have a block LU factorization code in Fortran and am experimenting with different GPU accelerators. In the actual block LU subroutine I have cublas_ztrsm function calls and a cublas_zgemm call. In a verifying subroutine, I have different implementations of an LU matrix multiplication. I have a lapack version, a cublas version, and magma and cublasXt versions implemented through C wrappers. Compiling the code with the cublas library and any of the LU methods results in the correct answer. If I compile with nvblas instead, the matrix multiplication (cublas_zgemm) in the LU factorization seems to fail. The nvblas.log file spits out 7 instances of the following:

cublasXtZgemm failed with error=3

followed by two instances of the same error but with error=8 (instead of 3). These 9 lines of errors repeat 5 times for a total of 45 errors. The weird thing is the cublas_zgemm is only called 30 times. So there is something else going on as well.

I also think there is a data dependency/sharing error because depending on how I order the runs and with different verifying multiplication methods (e.g. cublas v. cublasXt) I can have normal cublas runs fail (right after a failed nvblas run) and nvblas runs succeed (after certain successful runs).

Either way, I have a problem with nvblas failing to execute cublasXtZgemm function calls. I can create an example case if need be but if someone can tell me what the errors mean I should be able to resolve it myself.

Thank you!

have you looked up the error codes in any of the relevant header files?

The cublas docs:


indicate that cublasXtgemm returns a cublasStatus_t enum error type. That enum is defined in cublas_api.h:

/* CUBLAS status type returns */
typedef enum{
} cublasStatus_t;

I would conclude that error code 3 is:


and error code 8 is:


additional text description is provided in the docs:


Having said all that, I find the usage of nvblas to intercept cublas API calls to be a bit unusual.

Thank you so much for finding those! Unfortunately that makes this even more confusing. So, nvblas also intercepts normal level 3 blas calls (not just cublas calls). If I change my cublas calls in the LU factoring subroutine to just lapack calls nvblas seems to intercept them more (ztrsm this time as well; don’t ask me why it didn’t intercept cublas_ztrsm). CublasXtZtrsm also returns with an error=3 in the log file. According to the documentation, I should try to deallocate as much as possible to make room for the function calls. The weird thing though, is it’s a lapack call and nvblas handles everything. There is no allocating on my side. The same problem exists with the cublas_ztrsm and cublas_zgemm calls because I was using the fortran_thunking method which meant CUDA handled all of the allocation and deallocation. I also inspected my GPU with nvidia-smi and the memory is pretty much completely clear and the code is the only process running on it when launched.

The error=8 only shows up for certain zgemm calls but it also doesn’t make sense. I have a Tesla K40c card so the architecture shouldn’t be a problem either. It also shows up despite there being no zgemm function calls (if I comment out every zgemm related call)

I’m really at a loss as to why all of this is occurring; could it be a bug in the nvblas library? I hesitate to jump to that conclusion but the errors thrown don’t really fit the situation. I’m also still on CUDA 6 so maybe this was a problem fixed in a 6.5 or 7 release of CUDA/nvblas?

Why are you trying to intercept cublas calls with nvblas?

that is not the intent of nvblas

nvblas is intended to intercept ordinary host blas library calls.

If you have cublas calls in your code already, you should just link those to cublas library.

Okay, sorry for the delay, I was sick for a few days. Update: Thank you for pointing out that nvblas only intercepts CPU blas calls. I had not realized that but it still should not have been producing the kinds of problems that I am seeing. Using nvblas correctly or in tandem with cublas, cublasXt, fortran, or even MAGMA calls all resulted in the same kind of error. I then tried nvblas 6.5 on a different machine and got it working perfectly just by following the normal documentation. I think there is some unknown bug with how nvblas is communicating on my first machine. It could be a cuda 6 problem but unfortunately I cannot confirm that because I’m unable update cuda on that machine.

So, for any following this thread: the problem is assumed to be machine specific and has no fix mainly because the problem is still unknown.