I have a block LU factorization code in Fortran and am experimenting with different GPU accelerators. In the actual block LU subroutine I have cublas_ztrsm function calls and a cublas_zgemm call. In a verifying subroutine, I have different implementations of an LU matrix multiplication. I have a lapack version, a cublas version, and magma and cublasXt versions implemented through C wrappers. Compiling the code with the cublas library and any of the LU methods results in the correct answer. If I compile with nvblas instead, the matrix multiplication (cublas_zgemm) in the LU factorization seems to fail. The nvblas.log file spits out 7 instances of the following:
cublasXtZgemm failed with error=3
followed by two instances of the same error but with error=8 (instead of 3). These 9 lines of errors repeat 5 times for a total of 45 errors. The weird thing is the cublas_zgemm is only called 30 times. So there is something else going on as well.
I also think there is a data dependency/sharing error because depending on how I order the runs and with different verifying multiplication methods (e.g. cublas v. cublasXt) I can have normal cublas runs fail (right after a failed nvblas run) and nvblas runs succeed (after certain successful runs).
Either way, I have a problem with nvblas failing to execute cublasXtZgemm function calls. I can create an example case if need be but if someone can tell me what the errors mean I should be able to resolve it myself.