gemmStridedBatched

I’m interested in using

cublasSgemmStridedBatched(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const float *alpha, const float *A, int lda, long long int strideA, const float *B, int ldb, long long int strideB, const float *beta, float *C, int ldc, long long int strideC, int batchCount)

to perform C = α op ( A ) op ( B ) + β C in the case where A is a block diagonal matrix. In this case it’s the horizontal strips/submatrices of the matrix C that make of the matrices in the C-batch, which means that the matrices in the C-batch are mixed in memory and I’d like to have strideC < m × n. The documentation does not warn against this, but when I try to run this I get the error:

Check failed: status == CUBLAS_STATUS_SUCCESS (7 vs. 0) CUBLAS_STATUS_INVALID_VALUE
On entry to SGEMM parameter number 15 had an illegal value

Since there are many small blocks I much prefer StridedBatched over Batched. Does anyone have a solution? Am I correct that StridedBatched does not accept cases where the C matrices are mixed in memory? I would think it would be better to allow for this and if collisions happen it is the user’s fault. Will this be changed in later versions? Can users alter this?

This is because the implementation logic of cublasSgemmStridedBatched in different cuda versions is different.
In cuda 9, the C cann’t be separated by column, which means strideC must greater or equal to the size of C block.
In cuda 10+, it works. You can try it.