Would cudaMalloc-ing more memory than what cusparseXcsrgemmNnz calculated for a cusparse matrix work?

Here is the code snippet from the sample in the documentation,

int baseC, nnzC;
// nnzTotalDevHostPtr points to host memory
int *nnzTotalDevHostPtr = &nnzC;
cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST);
cudaMalloc((void**)&csrRowPtrC, sizeof(int)*(m+1));
cusparseXcsrgemmNnz(handle, transA, transB, m, n, k,
        descrA, nnzA, csrRowPtrA, csrColIndA,
        descrB, nnzB, csrRowPtrB, csrColIndB,
        descrC, csrRowPtrC, nnzTotalDevHostPtr );
if (NULL != nnzTotalDevHostPtr){
    nnzC = *nnzTotalDevHostPtr;
    cudaMemcpy(&nnzC, csrRowPtrC+m, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(&baseC, csrRowPtrC, sizeof(int), cudaMemcpyDeviceToHost);
    nnzC -= baseC;
cudaMalloc((void**)&csrColIndC, sizeof(int)*nnzC);
cudaMalloc((void**)&csrValC, sizeof(float)*nnzC);
cusparseScsrgemm(handle, transA, transB, m, n, k,
        descrA, nnzA,
        csrValA, csrRowPtrA, csrColIndA,
        descrB, nnzB,
        csrValB, csrRowPtrB, csrColIndB,
        csrValC, csrRowPtrC, csrColIndC);

My question is instead of cudaMalloc-ing nnzC*size for csrColIndCand csrValC, if I cudaMalloc with a predetermined constant nnz_pre*size where I can guarantee nnz_pre is always larger than nnzC for my problem, would it cause any problems for standard cusparse operations like cusparseDcsrgeam and cusparseDcsrgemm?

The motivation behind this is for a real time application that involves a camera. It causes significant slow down to cudaMalloc and cudaFree for every single frame. That’s why I want to be able to cudaMalloc a fixed size just once at the very start, and reuse that same chunk of memory for the computation of every single frame.

The current behavior I am observing for my unit test is non-deterministic, sometimes it produces the right result but sometimes it doesn’t. Hence I am unable to provide a minimal example to reproduce the problem. I would love to hear from the cusparse team on whether allocating extra memory is supposed to work or not. If not, is there a way to avoid cudaMalloc-ing and cudaFree-ing for every single frame while doing sparse matrix computation?