Using spmv with large sparse matrix

Hi I have some problems using this kernel and large sparse matrix(640001x640001).
global void spmv_jd(
const int num_rows,
const int* ptr,
const int* indices,
const float* data,
const float* x,
float* y
)
{
int row = blockDim.x * blockIdx.x + threadIdx.x;

if(row < num_rows)
{
shared float dot;

int row_start = ptr[row];
int row_end = ptr[row+1];
for (int j = row_start; j < row_end; j ++)
	dot += data[j]*x[indices[j]];
y[row] += dot ;

}
I know that the kernel isn’t very fast, but it’s only for testing. But at the moment I have still problmes with the solution, using a large matrix. I think the problem is using 1D thread aligment and then reaching the limit of 65535 threads. Has anybody an idea to solve it. Perhaps a loop for Kernel launch with num_kernel_launch = (N/MAXTHREADS)+1?

You should not declare “shared float dot;”

int row = blockDim.x * blockIdx.x + threadIdx.x;

if(row < num_rows)

{

    float dot = 0.0f;

int row_start = ptr[row];

    int row_end = ptr[row+1];

    for (int j = row_start; j < row_end; j++)

        dot += data[j]*x[indices[j]];

    y[row] += dot ;

}

Also, 2D grid can solve your problem, if you use 2-D grid and 256 threads per block,

then you can handle 6553565535256 rows.

By the way, why not using CUSPARSE?

Just launch a 2D grid instead of a 1D grid. Grids are layed out in column major order, so the equivalent row number using 1D blocks just becomes

rowx = blockDim.x * blockIdx.x + threadIdx.x;

row = rowx + blockIdx.y * gridDim.x * blockDim.x;

That gives you a limit of 65535 * 65535 * blockDim.x rows per sparse matrix, ie. something of the order of 400e9 rows.

Just as a side comment, I would be very surprised if your kernel actually computed the correct product - using dot before initialising it and declaring it as a shared variable are probably not what you mean to do in this case.

Edit: Lung Sheng beat me by 60 seconds…

Thanks for your reply.

At the moment I am using CUSPARSE but it is too slow( I develope a CG-Solver). I hope that I can write an multiplikation algorithm which can benefit of my matrix design(symmetric+positiv defenit)

I changed how you told me. I am launching the kernel with:

const int threadsPerBlock = 256;

    const int blocksPerGrid = (N/threadsPerBlock)+1;

    dim3 dimBlock(threadsPerBlock,1,1);    

    dim3 dimGrid(blocksPerGrid,1);

where N is 640001(rows=num_rows). It seem that the kernel computes all elements( solution vector has size 640001 and no zeros in it), but after 2 Interations of my CG (lauching 3 times spmv kernel), the vallues differs from the the CUSPARSE kernel. I think its something with the floting point precision. Because in the first iteration step spmv_jd and cusparseDcsrmv has the the same solution. And with then they begin to differ with each iteration.

__global__ void spmv_jd(

                        const int num_rows,

                        const int* ptr, 

                        const int* indices, 

                        const double* data,

                        const double* x,

                        double* y

                        )

{

int rowx = blockDim.x * blockIdx.x + threadIdx.x;

 int row = rowx + blockIdx.y * gridDim.x * blockDim.x;

if(row < num_rows)

  {

    double dot = 0.0;

int row_start = ptr[row];

    int row_end = ptr[(row+1)];

    for (int j = row_start; j < row_end; j ++)

		dot += data[j]*x[indices[j]];

	y[row] += dot ;

  }

}

I am using a Quadro 5000 with sm_20. Has somebody experience with this issue?

You are still launching a 1D grid, but something else is wrong. 640001 rows should only require 2501 blocks of 256 threads, which will work with a 1D grid.

Are you really sure that dot product formula is correct? Is your intention really to calculate y ← y + data * x?

Hm… my I think so. I take all elments of a row and multiply them with the correspond elemts of x and then I add all this results to one result in y. Do you see a mistake there? I am still confused :-)

I tested with a small 7x7 matrix and there is the result correct…

I think that you need to do verification.

In fact, if your initial vector of CG is zero, then of course, CUSPARSE and your kernel report the same vector.

If your matrix A is integer-based, for example, standard Laplacian, then you can try integer x, and compute y = y + A*x,

then you will see which one is wrong easily.

Thanks for the tipp, but i know that the result of my CG using CUSPARSE is fine only when i change from CUSPARSE smpy to my spmv the result is wrong. Crazy is that after the first iteration(spmy has called 2x times one with the null vektor and then with a non zero vector) the result of my spmv is equal to the CUSPARSE spmv. Here is my implementation

//compute Ax

    //cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.0, descr, d_val, d_row, d_col, d_x, 0.0, d_Ax);

    spmv_jd<<< dimGrid, dimBlock >>>(N, d_row, d_col, d_val, d_x, d_Ax);

//compute -Ax+b 

    cublasDaxpy(N, -1.0, d_Ax, 1, d_b, 1);

//r0 = (b-Ax)

    cublasDcopy(N,d_b,1,d_r,1);

// p0 = (-)r0

    cublasDcopy(N,d_r,1,d_p,1);

k = 0;

res = cublasDdot(N,d_r,1,d_r,1);

    printf("res %Lg\n",res);

while((res > tol*tol) && (k < max_iter))

    {

      //compute Ap

      //cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.0, descr, d_val, d_row, d_col, d_p, 0.0, d_Ap); //here can i change between my spmv and CUSPARSE

      spmv_jd<<< dimGrid, dimBlock >>>(N, d_row, d_col, d_val, d_p, d_Ap);

      cudaMemcpy(h_test, d_Ap, N*sizeof(double), cudaMemcpyDeviceToHost); //here i copy back the result of spmv and compare it with the result of cusparseDcsrmv

      writeResult (h_test, "Solution_test.txt", N);

// compute r dot r

      res = cublasDdot(N,d_r,1,d_r,1);

//compute (r dot r) / (Ap dot p)

      a = res / cublasDdot(N,d_Ap,1,d_p,1);

//compute x = x + a*Ap

      cublasDaxpy(N,a,d_p,1,d_x,1);

//compute r = r - a*Ap

      cublasDaxpy(N, -a,d_Ap,1,d_r,1);

//compute (r1 dot r1) / (r0 dot r0)

      b = cublasDdot(N,d_r,1,d_r,1)/ res;

//compute pb = b*p

      cublasDscal(N,b,d_p,1);

//compute p = r + pb 

      cublasDaxpy(N,1,d_r,1,d_p,1);

printf("iteration = %3d, residual = %e\n", k, sqrt(res));

      k++;

    }

My matrix has float values and has zero indexing for all 3 arrays. I have to use double precision because without, CG does not converge.

you should modify your kernel, do y :=Ax, not y := y + Ax

if(row < num_rows)

  {

    double dot = 0.0;

int row_start = ptr[row];

    int row_end = ptr[(row+1)];

    for (int j = row_start; j < row_end; j ++)

		    dot += data[j]*x[indices[j]];

	  

	  y[row] = dot ;

  }

In your orginal kernel, variable d_Ax != A*p{k} but

d_Ax := A*( p{0} + p{1} + p{2} + … + p{k} )

Of course, if p{k} is still A-conjugate, then cublasDdot(N,d_Ap,1,d_p,1) gives you same value,

but you still have different values of x{k} and r{k}

cublasDaxpy(N,a,d_p,1,d_x,1);

cublasDaxpy(N, -a,d_Ap,1,d_r,1);

are you sure that the result x{k}, r{k} are the same as those of CUSPARSE?

Thanks a lot, your are so right. I change it and now it gives the correct result. Now I hope that I can improve the kernel to be faster as CUSPARSE. Perhaps there is a opportunity to getting faster with an approximate inverse as a preconditioner for my CG.

Parallel calculation and application of preconditioners in CUDA is still an open research topic for the most part. A Jacobi preconditioner is normally about all that can be easily done, and that tends to be a rather poor option for anything other than very diagonally dominant matrices.