CUSPARSE: multiplying two sparse matrices (one of them has rows with complete zeroes)

Hi

I am trying to incorporate CUSPARSE after successfully developing my software with CUSP. CUSP takes more time to setup apparently compared to CUSPARSE and i want to reduce that setup time.
I am using cuda beta release that was announced at GTC2012 (san jose). It is installed as cuda-5.0.7 and the version command gives release 5.0, V0.2.1221 .
However i face a problem multiplying two sparse matrices. Both of them are stored in CSR format.

One of them is a regular 7-point laplacean and other one is a tall matrix so to say. So if A is the laplacean which is a sparse square matrix (MxM) then i am multiplying it with a B which is MxN. Here N<<M. So B is a tall matrix.

In addition to this some rows in B are completely zeroes. It sounds non-intuitive that i am using CUSPARSE and i use a matrix with zeroes filled in. I do this to save on the effort it might take in order to trim the matrix A so that both A and B agree on the inner dimensions.

I used the same matrices with CUSP and they worked fine.(albeit in different formats but i used the multiply function in cusp)

However in this case the function to calculate non-zeroes seems to return a very large (possibly garbage) value.

I am pasting a bit of the code here.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

z_vals= (double*)initVector(dim,MYDOUBLE);
for (i = 0; i < zsize; i++) z_vals[i]=1.0;
for (;i<dim;i++) z_vals[i]=0.0; /// this is the zero part of the matrix (rows and columns are non-zeroes)
//printf("\n zsize is %d\n",zsize);
checkCudaErrors( cudaMalloc((void**)&d_zcols, dimsizeof(int)) );
checkCudaErrors( cudaMalloc((void**)&d_zrows, dim
sizeof(int)) );
checkCudaErrors( cudaMalloc((void**)&d_zrowscsr, (dim+1)sizeof(int)) );
checkCudaErrors( cudaMalloc((void
*)&d_zvals, dim*sizeof(double)) );

cudaMemcpy(d_zcols, colsZ, dimsizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_zrows, rowsZ, dim
sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_zvals, z_vals, dim*sizeof(double), cudaMemcpyHostToDevice);

cusparseStatus= cusparseXcoo2csr(cusparseHandle, d_zrows, dim,dim, d_zrowscsr, CUSPARSE_INDEX_BASE_ZERO);

checkCudaErrors(cudaMalloc((void**)&d_azrowscsr,sizeof(int)*(dim+1)));
cusparseXcsrgemmNnz(cusparseHandle, nontrans, nontrans,
dim, numvecs,dim,
descr,nnzA,d_arowscsr,d_acols,
descrZ, dim, d_zrowscsr, d_zcols,
descrAZ, d_azrowscsr);
cudaMemcpy(&nnzAZ, d_azrowscsr+dim, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(&baseAZ, d_azrowscsr, sizeof(int), cudaMemcpyDeviceToHost);
nnzAZ-=baseAZ;
printf("\n Number of non-zeros in Z is %d AZ is %d value of baseAZ is %d \n",dim, nnzAZ, baseAZ);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The output of the last printf is a very large number. dim=32768, numvecs=8, nontrans=CUSPARSE_OPERATION_NON_TRANSPOSE, MYDOUBLE is just the normal double type.
A is the laplacean and in the code Z is the tall matrix. A is also stored in the CSR format as required by this routine of CUSPARSE.

Could someone suggest if it is not permitted in CUSPARSE to take the liberty that i am trying to take i.e. provide a sparse matrix with zeroes for values?

thank you for your time to read my query.

rohit