Incorrect resut when using cublasSgemm and cublasSaxpy together

Hi there, I’m trying to use cublasSgemm and cublasSaxpy to calculate the formula: C = A*C + I, so I used the code below.

An interesting thing is when I test the code with a 10x10 matrix, the result is correct. But when I test it with 1515 or any lareger matrix, it became C = AC. It seems that my programm didn’t execute the cublasSaxpy function.

When I reomve either cublasSgemm or cublasSaxpy, the code below works well. Can someone help?

for(int i=0; i<n_iterations; i++) {
        err = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
        err = cublasSaxpy(handle, m, &alphaI, I, 1, C, 1);
        float* temp = B;
        B = C;
        C = temp;
    }
    return B;

I suggest providing a complete code, along with the platform, compile command, CUDA version, GPU running on, and the actual vs. expected results.