Problem in block size (more than 1) in bsr format using cusparse library.

i am using cusparse library for SpMV (sparce matrix vector multiplication). While using bsr format i am facing a problem with block size. when i am going with block size more than one than in some cases it is failing. even i am considering row size of matrix as a multiple of block size(but documentation says it does not matter, zero will be padded if not multiple of block size). Let say if my matrix row size is 4096 then i am going with block size with 8 or 4. but it is not always working. Further if there is mismatch in row and column of a matrix then what procedure should i use (i can only use one block size not two). Right now in this case i am using block size one only.
The cusparse bsr conversion and spmv routine i am using is as follows:

(1). cusparseXcsr2bsrNnz(handle, dir, m, n, descr, csrRowPtr, ColIndex, blockDim, descr, bsrRowPtr, &nnzb); for nnzb (no. of nonzero block) computation.

(2). status = cusparseDcsr2bsr(handle, dir, m, n, descr, Values, csrRowPtr, ColIndex, blockDim, descr, bsrVal, bsrRowPtr, bsrColIndex); for conversion to csr to bsr.

(3). status = cusparseDbsrmv(handle, dir, CUSPARSE_OPERATION_NON_TRANSPOSE, mb, nb, nnzb, &alpha,descr, bsrVal, bsrRowPtr, bsrColIndex, blockDim, x, &beta, y); for bsr SpMV computation.

mb and nb is computed as: mb = m + blockDim - 1 / blockDim; and nb = n + blockDim - 1 / blockDim;

I follow cusparse lib document( and use given example for bsr. I am facing two failure copy from device to host fail and device malloc fail(though have enough memory).
Could you please help me in this regard or if could you give some example reference.

and for csc what spmv routine should i use. conversion API is provided but not SpMV kernel.
please help. Thanks