Problem with block size with SpMV API for BSR format . COO, CSC API (routines) not provided.

i am using cusparse library for SpMV (sparce matrix vector multiplication). While using “bsr” format i am facing a problem with block size. when i am going with block size more than one than in some cases it is failing. even i am considering row size of matrix as a multiple of block size(but documentation says it does not matter, zero will be padded if not multiple of block size). Let say if my matrix row size is 4096 then i am going with block size with 8 or 4. but it is not always working. Further if there is mismatch in row and column of a matrix then what procedure should i use (i can only use one block size not two). Right now in this case i am using block size one only (which is equallent to csr not bsr i guess?).
The cusparse bsr conversion and spmv routine i am using is as follows:

(1). cusparseXcsr2bsrNnz(handle, dir, m, n, descr, csrRowPtr, ColIndex, blockDim, descr, bsrRowPtr, &nnzb); for nnzb (no. of nonzero block) computation.

(2). status = cusparseDcsr2bsr(handle, dir, m, n, descr, Values, csrRowPtr, ColIndex, blockDim, descr, bsrVal, bsrRowPtr, bsrColIndex); for conversion to csr to bsr.

(3). status = cusparseDbsrmv(handle, dir, CUSPARSE_OPERATION_NON_TRANSPOSE, mb, nb, nnzb, &alpha,descr, bsrVal, bsrRowPtr, bsrColIndex, blockDim, x, &beta, y); for bsr SpMV computation.

mb and nb is computed as: mb = m + blockDim - 1 / blockDim; and nb = n + blockDim - 1 / blockDim;

I follow cusparse lib document( and use given example for bsr. I am facing two failure copy from device to host fail and device malloc fail(though have enough memory).
Could you please help me in this regard or if could you give some example reference.

For csc what spmv routine should i use. conversion API is provided but not SpMV kernel.

can i use hyb API ‘cusparseDhybmv’ for “coo” and “ell”. As there is not dedicated SpMV API’s for these, My concern is that if i go with “hyb” API as an alternative for “coo” and “ell” does this justifiable and will not affect performance than original API’s supposed to give?

please help. Thanks