I’m interested in using

cublasSgemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, long long int strideA, const float *B, int ldb, long long int strideB, const float *beta, float *C, int ldc, long long int strideC, int batchCount)

to perform C = α op ( A ) op ( B ) + β C in the case where A is a block diagonal matrix. In this case it’s the horizontal strips/submatrices of the matrix C that make of the matrices in the C-batch, which means that the matrices in the C-batch are mixed in memory and I’d like to have strideC < m × n. The documentation does not warn against this, but when I try to run this I get the error:

Check failed: status == CUBLAS_STATUS_SUCCESS (7 vs. 0) CUBLAS_STATUS_INVALID_VALUE

On entry to SGEMM parameter number 15 had an illegal value

Since there are many small blocks I much prefer StridedBatched over Batched. Does anyone have a solution? Am I correct that StridedBatched does not accept cases where the C matrices are mixed in memory? I would think it would be better to allow for this and if collisions happen it is the user’s fault. Will this be changed in later versions? Can users alter this?