Hi I have some problems using this kernel and large sparse matrix(640001x640001). global void spmv_jd(
const int num_rows,
const int* ptr,
const int* indices,
const float* data,
const float* x,
float* y
)
{
int row = blockDim.x * blockIdx.x + threadIdx.x;

if(row < num_rows)
{ shared float dot;

int row_start = ptr[row];
int row_end = ptr[row+1];
for (int j = row_start; j < row_end; j ++)
dot += data[j]*x[indices[j]];
y[row] += dot ;

}
I know that the kernel isn’t very fast, but it’s only for testing. But at the moment I have still problmes with the solution, using a large matrix. I think the problem is using 1D thread aligment and then reaching the limit of 65535 threads. Has anybody an idea to solve it. Perhaps a loop for Kernel launch with num_kernel_launch = (N/MAXTHREADS)+1?

That gives you a limit of 65535 * 65535 * blockDim.x rows per sparse matrix, ie. something of the order of 400e9 rows.

Just as a side comment, I would be very surprised if your kernel actually computed the correct product - using dot before initialising it and declaring it as a shared variable are probably not what you mean to do in this case.

At the moment I am using CUSPARSE but it is too slow( I develope a CG-Solver). I hope that I can write an multiplikation algorithm which can benefit of my matrix design(symmetric+positiv defenit)

I changed how you told me. I am launching the kernel with:

const int threadsPerBlock = 256;
const int blocksPerGrid = (N/threadsPerBlock)+1;
dim3 dimBlock(threadsPerBlock,1,1);
dim3 dimGrid(blocksPerGrid,1);

where N is 640001(rows=num_rows). It seem that the kernel computes all elements( solution vector has size 640001 and no zeros in it), but after 2 Interations of my CG (lauching 3 times spmv kernel), the vallues differs from the the CUSPARSE kernel. I think its something with the floting point precision. Because in the first iteration step spmv_jd and cusparseDcsrmv has the the same solution. And with then they begin to differ with each iteration.

You are still launching a 1D grid, but something else is wrong. 640001 rows should only require 2501 blocks of 256 threads, which will work with a 1D grid.

Are you really sure that dot product formula is correct? Is your intention really to calculate y <- y + data * x?

Hm… my I think so. I take all elments of a row and multiply them with the correspond elemts of x and then I add all this results to one result in y. Do you see a mistake there? I am still confused :-)

I tested with a small 7x7 matrix and there is the result correct…

Thanks for the tipp, but i know that the result of my CG using CUSPARSE is fine only when i change from CUSPARSE smpy to my spmv the result is wrong. Crazy is that after the first iteration(spmy has called 2x times one with the null vektor and then with a non zero vector) the result of my spmv is equal to the CUSPARSE spmv. Here is my implementation

//compute Ax
//cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.0, descr, d_val, d_row, d_col, d_x, 0.0, d_Ax);
spmv_jd<<< dimGrid, dimBlock >>>(N, d_row, d_col, d_val, d_x, d_Ax);
//compute -Ax+b
cublasDaxpy(N, -1.0, d_Ax, 1, d_b, 1);
//r0 = (b-Ax)
cublasDcopy(N,d_b,1,d_r,1);
// p0 = (-)r0
cublasDcopy(N,d_r,1,d_p,1);
k = 0;
res = cublasDdot(N,d_r,1,d_r,1);
printf("res %Lg\n",res);
while((res > tol*tol) && (k < max_iter))
{
//compute Ap
//cusparseDcsrmv(handle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, 1.0, descr, d_val, d_row, d_col, d_p, 0.0, d_Ap); //here can i change between my spmv and CUSPARSE
spmv_jd<<< dimGrid, dimBlock >>>(N, d_row, d_col, d_val, d_p, d_Ap);
cudaMemcpy(h_test, d_Ap, N*sizeof(double), cudaMemcpyDeviceToHost); //here i copy back the result of spmv and compare it with the result of cusparseDcsrmv
writeResult (h_test, "Solution_test.txt", N);
// compute r dot r
res = cublasDdot(N,d_r,1,d_r,1);
//compute (r dot r) / (Ap dot p)
a = res / cublasDdot(N,d_Ap,1,d_p,1);
//compute x = x + a*Ap
cublasDaxpy(N,a,d_p,1,d_x,1);
//compute r = r - a*Ap
cublasDaxpy(N, -a,d_Ap,1,d_r,1);
//compute (r1 dot r1) / (r0 dot r0)
b = cublasDdot(N,d_r,1,d_r,1)/ res;
//compute pb = b*p
cublasDscal(N,b,d_p,1);
//compute p = r + pb
cublasDaxpy(N,1,d_r,1,d_p,1);
printf("iteration = %3d, residual = %e\n", k, sqrt(res));
k++;
}

My matrix has float values and has zero indexing for all 3 arrays. I have to use double precision because without, CG does not converge.

Thanks a lot, your are so right. I change it and now it gives the correct result. Now I hope that I can improve the kernel to be faster as CUSPARSE. Perhaps there is a opportunity to getting faster with an approximate inverse as a preconditioner for my CG.

Parallel calculation and application of preconditioners in CUDA is still an open research topic for the most part. A Jacobi preconditioner is normally about all that can be easily done, and that tends to be a rather poor option for anything other than very diagonally dominant matrices.